Methods of providing network graphical representation of database records

ABSTRACT

Methods of providing network graphical representation of database records. Selecting the database records according to descriptive criteria. Identifying attributes of the record class and associating network nodes to instances of the attributes from the database records. Connecting the network nodes with network links that designate network nodes having common instances of the attributes.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/982,151 filed Dec. 29, 2015, which is a continuation of U.S. patentapplication Ser. No. 12/578,253 filed Oct. 13, 2009, which is aDivisional of U.S. patent application Ser. No. 11/120,423 filed May 3,2005 (now U.S. Pat. No. 7,672,950 issued Mar. 2, 2010), which claims thebenefit of U.S. Provisional Patent Application No. 60/567,997 filed May4, 2004, the entire contents of which are incorporated herein byreference in their entireties.

FIELD OF THE INVENTION

The present application generally relates to the field of data miningand analysis. More particularly, it relates to methods and systems forpresenting related database records in a network graphicalrepresentation.

BACKGROUND

“Information age” and “knowledge economy” are just two of the termscommonly used to describe the explosion of digital information thatcharacterizes our era. Whatever you call it, there is no question thatthe volume of information that is created is growing at unprecedentedrates. Numerous attempts have been made to quantify the rate of newknowledge development and have produced various estimates of itsexponential growth. Various sources describe and attempt to quantifythis information explosion. A few examples of the kind of statisticsoften cited are:

-   -   Total human knowledge generally doubles every 5-10 years    -   Scientific knowledge generally doubles every 3-5 years    -   Medical knowledge generally doubles every 2-8 years    -   Number of US patents issued has about doubled in the last 7        years    -   Approximately 1.5 million pages added to the web each day    -   Worldwide production of original content stored digitally in        1999 would take about 635 thousand to 2.1 million terabytes to        store

Regardless of the reliability of these estimates, they all point toundeniable explosion in new information. Computer technology has made iteasier to create and store new information. Both the number and size ofthe databases used to store this information are growing exponentially.

Despite the rapid growth of available information, human mentalcapabilities to assimilate and comprehend information have notsignificantly improved. The explosion of available information and ourinability to assimilate it leads to information overload. The vaststores of information make it increasingly difficult to find the rightinformation and even more difficult to make sense of the vast amount ofnew knowledge that is available.

Workers in the knowledge economy operate in an environment in which theyare awash in information but are unable to distill insights. Theseworkers often need to find and understand information related to aspecific topic or area of interest so that they can improve theirperformance and/or decision-making. However, despite the availability ofinformation that could inform and improve their decision-making, thereis no practical way to find or assimilate it.

Enormous investments by numerous companies have been made to helpinformation workers find the information “needle” they are seeking inthe vast “haystack” of data in which they are searching. The dominantparadigm for information retrieval can be referred to as “Search andSift”. The “Search and Sift” method invariably begins with a Booleansearch that returns a large number of matching search results. Thesearcher then sifts through the results to find the information they areseeking. Internet users and users of other large databases will be veryfamiliar with this method.

The majority of the investment in the field of information retrieval hasbeen focused on improving the “Search and Sift” process. Examples ofimprovements include:

-   -   Query refinement—Query refinement attempts to determine the        intent behind the searcher's query and refine the query in order        to capture more of the documents that are relevant to the search        or to exclude more irrelevant documents from the result set. An        example of query refinement is “synonym expansion” in which the        query terms are augmented to include synonyms of the search        terms in the hope of capturing more relevant documents.    -   Result ranking—A second means of improving the “search and sift”        method is result ranking. Result ranking attempts to order the        search results based on their relevance to the searchers intent.        Relevance rankings have been estimated in various means        including; frequency of use of search terms, location of search        terms within the document, and perceived “importance/usefulness”        of the documents in the result set. Perhaps the best example of        result ranking is Google's page rank metric which is based on        the number of other web pages that link to the search result        page.    -   Result filtering—A final example of means to improve the “search        and sift” method is result filtering. Result filtering attempts        to classify the documents in the result set based on some        classification scheme. The hope is that this will allow the        searcher to narrow down his/her “sifting” to a subset of the        result set that is most closely related to the area of interest.        Examples of result filtering include; Northern Light's “results        folders” (see, e.g., FIG. 1) which are based on a fixed taxonomy        of document classifications, Vivisimo's document clustering tool        which classifies documents into a hierarchical tree structure        (see, e.g., FIG. 2) based on the semantic content of the        documents, and Grokker, which classifies documents into a        dynamic hierarchical structure similar to Vivisimo, but also        provides a visual display of the relative size of each        classification using its “bubble display” (see, e.g., FIG. 3).

All of these methods are useful improvements on the “search and sift”method, however, they all presume a specific type of information need,namely that the searcher is looking for a specific PIECE of information,and that the information being sought can be found WITHIN the documentsin the result set. This kind of information retrieval is aimed atfinding answers to questions such as:

-   -   Who killed Bobby Kennedy?    -   What is the world's second tallest mountain?    -   What is the weather forecast for Palo Alto, Calif. tomorrow?    -   What is the IBM's current stock price?

While the embodiments described herein represent further improvement onthe “search and sift” method, their primary contributions are aimed atmeeting a different kind of information need. The primary purpose ofthese embodiments is to assist information users in making sense ofsearch results, or large document sets by providing a means forassimilating the patterns of information AMONG the document in the set.This kind of information is referred to herein as “metadata” because itrepresents higher level information than is contained in any particulardocument or record in the database or search result. This kind ofinformation retrieval is aimed at answering questions such as:

-   -   How many documents are related to my area of interest, and how        quickly is this number growing?    -   Who are the main authors of information about this topic?    -   What companies are producing information on this topic?    -   What is the relationship among companies/authors that are        working in this domain?

The described embodiments utilize advanced visualization techniques toreveal the metadata associated with a set of documents or a searchresult. In order to understand the novel contributions of the presentinvention, it is useful to review other systems and techniques in thisfield, in particular within two areas of study; 1) Existing methods ofpresenting metadata, 2) Visualization methodologies used forunderstanding large data sets.

Existing Methods of Presenting Metadata

Previous efforts to analyze and present metadata related to large datasets can be divided into a number of categories. A brief description ofeach and examples of the existing state of the art are provided belowfor the purpose of differentiating the present invention.

Statistical Analysis

One of the simplest and most widely used means of analyzing sets ofdocuments is statistical analysis. Statistical analysis can be as simpleas calculating the number of documents in the set by date,author/inventor, author/inventor affiliation, country, classification,or other attribute. It may also include calculation of statisticsrelevant to the particular type of data being examined. For instance, inthe patent data domain, statistics like number of citations,citations/patent/year, time from filing to grant, age of most recentcitation, age of most recent academic citation, and other statistics aresometimes calculated. These statistical methods are employed widely, andare in some instances automated in commercial applications such as thoseoffered by Delphion, Micropatent and CHI Research in the patent spaceand many others in other domains.

Statistical analysis can provide some useful insight into the set ofdocuments under evaluation, but is clearly limited as to the amount ofinsight that can be obtained. The best-known tools of this type providetextual reports or simple bar charts showing the number of documentswith each attribute value (e.g. How many documents by Company A, CompanyB, Company C, etc.) or the statistics associated with the overalldocument set (e.g. Average time from filing to grant). They do notprovide information about how the various documents are related to eachother, and they do not provide a means for interacting with the metadatain a way that allows the user to explore what the various attributes ofthe documents reveal about the overall document set. It is an objectiveof one or more embodiments of the present invention to provide a meansfor users to understand the relationships among groups of documents andto provide a means for deep exploration into the metadata associatedwith the document set or search result.

Clustering

Another method used for revealing metadata about large sets of documentsis clustering. Various tools have been developed that group documentsinto clusters. Some of these tools separate documents into clustersbased on a fixed taxonomy of categories, while others utilize syntacticinformation within the documents to cluster them into a dynamic set ofcategories. Two examples of fixed taxonomy clustering tools are theNorthern Light search engine and The Brain's <thebrain.com> web searchtool. The fixed taxonomy clustering method is accomplished in one of twoways. First, categories may be based on explicit attributes of thedocuments. For instance, Internet search results can be divided intocategories based on their domain extensions such as “.com”, “.net”,“.edu”, or their country domain such as “.sp”, “.ge”, “.jp”, etc.Secondly, categories may be based on a taxonomy into which documents inthe data repository have previously been assigned. This is generallyaccomplished by manually reviewing documents or the domains under whichthose documents fall and assigning them to one or more categories withinthe fixed taxonomy.

A second method of clustering documents or search results is based onthe creation of a dynamic taxonomy. These clustering techniques usesyntactic data within the documents and then cluster the document setinto smaller groups and “name” those groups based on the words orphrases they have in common. The clustering method essentially createsan automated classification schema that can provide insight into thenature of the documents in the set. This technique has been applied to awide variety of document types and various commercial softwareapplications are available which perform this function. Examples of theuse of clustering techniques within the domain of patents includes theVivisimo and Themescape tools <micropat.com/static/advanced.htm> thatare incorporated into Micropatent's Aureka<micropat.com/static/index.htm> tool set and the Text Clustering tools<delphion.com/products/research/products-cluster> available inDelphion's tool set. Vivisimo's tools can be configured to operate onany set of text documents, as can the semantic analysis tools developedby Inxight <inxight.com/products/smartdiscovery>.

Using these clustering tools, basic metadata about a document set or asearch result can be presented. The methods employed by the abovereferenced tools can automatically display the number of documents inthe set or search result that fall into each category, making itpossible to more quickly “sift” through the results to find the piece ofinformation that is being sought. They also provide some valuableinformation about the contents of the document set or search result.

The value of the best known clustering tools is limited in two importantways. First, the metadata provided about the contents of the documentset is only as good as the taxonomy into which it is clustered. This isan inherent limitation of both fixed and dynamic taxonomy clusteringtechniques.

Fixed taxonomies are limited in their usefulness by a number of factors:

-   -   The taxonomy is based on the priorities of its creator, not the        searcher. The creation of a taxonomy entails making choices        about what attributes of the information is most important. For        example, the first branches in a taxonomy of bird types could be        established in multiple possible ways; migratory versus        non-migratory, waterfowl versus landfowl, etc. Often, the        priorities of the taxonomer are not aligned with the needs of        the information user, thus limiting the value of the clustering        metadata provided.    -   Fixed taxonomies can not easily be adjusted as the contents of        the database evolve. Once a taxonomy has been established and        users have begun using it, it becomes rigid and difficult to        change. As the contents evolve, there is inevitably a need to        add new categories, sub-divide categories, and recombine        categories. This makes it difficult to compare results over        time. As an example, consider the taxonomy of technologies        created by the WIPO known as the International Patent        Classification system (IPC). The IPC is now in it's seventh        edition. In each edition, classes were added, moved, sub-divided        and eliminated. However, the millions of patent documents that        were filed prior to the revision remain classified under the        original classification schema that existed at the time they        were granted. This makes the presentation of clustering metadata        problematic when based on a fixed taxonomy.    -   Another issue related to fixed taxonomies is that the documents        in the data set typically do not fall into a single        classification. This creates a classification problem that has        typically been solved by assigning the documents into multiple        categories within the taxonomy. This multiple-assignment creates        a challenge for how to display the clustered results when many        documents fall into multiple categories. They typical solutions        are to count each document only within a single (primary)        classification, or to count the document multiple times, once        for each category of classification. Both solutions have        problems. The first ignores important information about        secondary classifications, and the second represents multiple        instances of each document.    -   The other major limitation of fixed taxonomies is the difficulty        in assigning documents to the categories. Typically, this is a        manual process that is done either by the author of the document        or by a specially trained person or persons who take        responsibility for classification. Once again, both options have        problems. Author classification suffers from a lack of        consistency, while centralized classification is extraordinarily        time consuming when large numbers of documents must be        classified.

Dynamic taxonomies have been created in order to overcome some of thelimitations of fixed taxonomies. However, they have limitations of theirown which diminish their usefulness in providing metadata about a largedocument set. Some of the challenges associated with dynamic taxonomiesare described below:

-   -   All dynamic taxonomy systems known by the inventors are based on        semantic data. Simply put, the classification of documents is        based on the similarity of the words contained in the documents.        The problem with this is that all languages are extremely        imprecise when it comes to expressing ideas. Any classification        of documents based on semantic similarity will suffer from both        synonymy (multiple words expressing the same meaning) and        polysemy (words have multiple meanings). Although there is        certainly value in syntactic clustering, the experience of the        inventors shows that the clusters created are suggestive of the        contents, but far from precise.    -   A second linguistic issue associated with semantic clustering is        multiple languages. Semantic clustering tools completely fail        when documents of different languages are included in the data        set. As the trend toward globalization continues, this problem        will continue to increase in importance. Some attempts have been        made to use multilingual thesauri to allow linguistic comparison        of multilingual document sets, but this research is still in its        infancy.    -   A final limitation of dynamic taxonomies is the lack of        comparability between clusters from one document set or search        result and another. Because the taxonomy is created specifically        for the document set, no two taxonomies created for different        document sets or different search results can be compared.    -   Dynamic taxonomies also suffer from the multiple classification        problem described above.

The second limitation of the clustering technique is that any taxonomyonly describes the document set or search result in relation to a singleattribute. Most taxonomies are meant to describe the topics or themes ofthe documents they categorize. While this information is useful, thereis no system known by the inventors that allows users to simultaneouslymake use of clustering information as well as the variety of otheravailable sources of metadata that describes the document set or searchresult. It is an objective of one or more embodiments of the presentinvention to provide users with a way to iteratively or simultaneouslymake use of the information contained in both fixed and dynamictaxonomies as well as a wide variety of other metadata sources in orderto provide a deep level of insight about the document set or searchresult that meets the specific information needs of the user.

Visualization Methodologies Used for Understanding Large Data Sets

The most advanced methods of obtaining insight into the metadata relatedto large document sets or search results are the visualizationtechniques. The field of data visualization has progressed rapidly overthe last several years as computer processors have become powerfulenough to perform the many millions of calculations required to displaycomplex data relationships. A number of data visualization tools arerelevant to consider with respect to the present invention. These can bedivided into several categories which will be described below. Relevantexamples will also be provided for each.

Hierarchical displays—One visualization method which has been employedis the hierarchical display. In its simplest form, documents or searchresults are represented in the form of a tree structure similar to thedirectory structure which is a well known metaphor for displayingcategorized data. One example of a hierarchical display designed toreveal metadata include Vivisimo's clustering tool described above.Because of the difficulty in displaying and comprehending a largehierarchical structure, several alternative methods have been developedto display these hierarchies. One example is the fisheye lens, which isused to display large hierarchies of patent citations withinMicropatent's Aureka tool set. The fisheye display allows users to zoomin on a portion of the hierarchy while still comprehending theirposition within the overall hierarchy.

Another sophisticated example of a hierarchical display is the Grokkertool developed by Grokis Corporation and described in U.S. Pat. No.6,879,332B2. Much like the Vivisimo tools, the Grokker tool clustersdocuments in a hierarchical structure based on a semantic algorithm.Unlike Vivisimo, the Grokker tool presents information to users in astylized marimekko diagram. The Grokker visualization represents thedocument set in a two dimensional space with each cluster of documentssized based on the number of documents in the cluster. The space on thescreen represents the overall search result. Within this space, clustersof documents are displayed (represented by circles or squares) andlabeled based on a common word found within those documents. Within eachcluster, are further “sub-clusters”, again represented visually andlabeled with a keyword. The hierarchy descends until finally thedocuments themselves are found at the lowest level of the hierarchy.

Each of these leading examples of hierarchical data visualization isbased on latent semantic information contained within the documents andas such, suffers from the limitations of semantic analysis as describedabove in the section describing fixed and dynamic taxonomies.

Spatial visualizations—A second type of visualization used to revealmeta-data within a large document set is the spatial visualization.Spatial visualization uses a map metaphor to arrange document records ina two or three-dimensional space. Although the various spatialvisualization tools differ somewhat, those known to the inventors followa similar methodology for creating a map. This method entails foursteps; 1) Calculate a semantic vector for each document—For eachdocument in the dataset, calculate a vector to represent the semanticcontent of the document (typically based on a histogram of word orconcept usage) 2) Create a similarity matrix—using the semantic vectorsfor each document, calculate a similarity metric for each document pairand thereby create a document similarity matrix. 3) Create a two orthree dimensional projection based on the similarity matrix—Usingprincipal component analysis or similar method (e.g. multidimensionalscaling), calculate locations for each document in the set such that thedistance between documents best reflects the similarity betweendocuments as described by the similarity matrix. and 4) Draw avisualization of the information space—Using the two or threedimensional projection, plot the documents as points within a documentspace.

Some spatial visualization tools take a further step of overlaying atopographical overlay on the information space to reveal the degree ofclustering. Some may even identify and label clustered groups based onwords that are common within the cluster.

An example of a spatial visualization tool is the Themescape map, whichis part of the patent analysis toolkit developed by Aurigin Systems andis now part of the offering provided by its acquirer The ThomsonCorporation through its subsidiary Micropatent. The Themescapevisualization tool uses semantic analysis about patent titles, abstractsor full text (at the user's discretion) to create a two dimensionalprojection of the information space based on the method described above.As is shown in FIG. 4, Themescape uses a map metaphor and overlays atopography over the information space with mountains representing themost highly clustered portions of the information space. Users of theThemescape map can explore the terrain by searching the informationspace for company names and other keywords or by selecting documentclusters to read or export back into a document list for further reviewor analysis.

The underlying technology for the Themescape tool came from researchperformed at the Pacific Northwest National Laboratory which also has aspatial visualization tool known as SPIRE (Spatial Paradigm forInformation Retrieval and Exploration). As is shown in FIG. 5, Spire hastwo visualization analogies, one, the “Starfield” shows a plot ofdocuments in three dimensions in a view that looks very much like astarry sky. The second, the “Theme view” is a topographical metaphorvery similar to the implementation with Aurigin's Themescape map.

While quite useful in developing a general understanding of theinformation contained in a large dataset, the spatial visualizationtools known to the inventors base their visualization solely on latentsemantic information contained within the documents and as such, sufferfrom the limitations of semantic analysis as described above in thesection describing dynamic taxonomies.

Network visualization—The final visualization technique that issometimes applied to increase understanding of the meta-data associatedwith large data sets is network visualization. In its simplest form, anetwork diagram (mathematicians would call this a graph) is simply a setof nodes (typically represented as dots) connected by links (also knownas edges or ties). Network graphs are not new, some network conceptsdate back at least to the ancient Greeks. Social network analysisdeveloped significantly in the 1930s. The development of moderncomputers with powerful processors has made it possible to createcomputerized network visualization tools.

The network paradigm is a very valuable method to apply to analysis oflarge data sets. There are two specific reasons why the network lens isso valuable. First, most visualization tools are designed to drawattention to the entity being analyzed (typically a document, a personor an institution). While network visualizations display informationabout individual entities as well, they also place significant emphasison the relationships between and among those entities. The networkdisplay shows not just the entities, but the system in which thoseentities operate. In recent years, various scientific and academicresearchers have come to the realization that reductionist analysis,(e.g. analysis that focuses on breaking a problem down into itscomponent parts and thoroughly analyzing each component) is limited.Fields like biology, genetics, ecology, sociology, physics, astronomy,information science and many others have all seen advances based onsystems analysis. Systems analysis focuses not on the smallest elements(e.g. genes, atoms—or perhaps quarks, and bits), but on the interactionsbetween and among those elements. The network tool is by its nature asystems visualization tool. It therefore can lead to entirely differentkinds of insight and conclusions than can the other visualization toolswithin the prior art.

A second reason that network visualization tools are appropriate foranalyzing large data sets is that networks have the potential to viewthe same set of information from a variety of viewpoints. Prior artnetwork visualization systems do not take significant advantage of thisfact, but networks have the potential to be transformed from oneperspective to another, with each perspective providing a differentinsight about the data being analyzed. The description of the NetworkVisualization System below will describe how this can be accomplished inorder to dramatically improve the insight that can be gained about largeand complex datasets.

First however, it is necessary to understand the present state of theart in network visualization and to identify some of the key limitationsof the existing tools. A variety of computerized network visualizationtools exist, including the following:

-   -   aiSee <aisee.com>    -   Cyram NetMiner—<netminer.com>    -   GraphVis <graphvis.org>    -   IKNOW—<spcomm.uiuc.edu/projects/TECLAB/IKNOW/index.html>    -   InFlow—<orgnet.com/inflow3.html>    -   Krackplot—andrew.cmu.edu/user/krack/krackplot/krackindex.html>    -   Otter <caida.org/tools/visualization/otted>    -   Pajek <vlado.fmf.uni-lj.si/pub/networks/pajek/>    -   UCINET & NetDraw <analytictech.com>    -   Visone <visone.de/>

Each one of these tools is capable of creating a network graph. The moreadvanced packages (e.g. UCINET/NetDraw, NetMiner) provide a range ofvisualization capabilities such as

-   -   Choosing alternative layout algorithms    -   Displaying multiple node types    -   Sizing/coloring/selecting shape of nodes based on the value of        an attribute    -   Displaying multiple link types    -   Sizing/coloring/selecting line type of links based on the type        of link

All of these tools are general-purpose network visualization tools. Inother words, they are designed to display network graphs of any datathat is structured in such a way that both the nodes and links of thenetwork are defined. Each of these tools uses a particular (and oftenunique) file format to capture information about nodes and nodeattributes, and links and link attributes. Node information is capturedthrough a node list where each node is represented by a node record.Node records contain at least one field which is a unique identifier forthat node, but can also contain other attribute fields that provideinformation about the node. Link information is captured through a linklist (or link matrix) which at a minimum identifies which two nodes arelinked, but may also capture information like link strength, linkdirection, and link type.

Although the tools differ in their details, the process of working withthem follows a common pattern as in FIG. 6. A user of any of the knownprior art systems gathers data from whatever sources are to be utilized.She then chooses a definition of what entity within the data willrepresent nodes and what information she will use to create linksbetween the nodes. The data must then be formatted to match theparticular file structure of the network visualization tool. In allcases, this requires the user to create a list of nodes and a link-listor link-matrix. Once formatted properly, the files of network data canthen be input into the network visualization system and analyzed andvisualized. The user can work with the data within the tool and selectdifferent layout algorithms or display attributes, and analyze thestructure of the network using any provided analytical tools.

If the user would like to develop an alternative visualization of thedata using a different definition of nodes and/or links, she must startfrom the beginning, redefine nodes and links, reformat the data into anode and link-list and re-introduce the new files into the visualizationsystem. The system can then display a network graph based on the newdefinition of nodes and links. Some of the inherent limitations of theseprior art systems include the following:

-   -   Database records from any data source can not be visualized        because they do not contain node and link information that is        usable by the system.    -   The process of accessing and formatting data is not integrated        into the network visualization tool    -   The user must format data into node/link lists to accommodate        the system    -   The user must select a stable definition of what constitutes a        node and what constitutes a link prior to formatting the data        for use in the system    -   There is no way to change definitions of nodes and links while        working within the network visualization system    -   If a new node/link definition is chosen, there is no way to        combine or connect the network based on the first definition        with the network based on the second definition, even though        both networks are based on the same underlying data.    -   There is no way to specify particularly useful node and link        definitions to be used repeatedly with data from a particular        source. Each time data from that source is to be visualized, the        user must start from the beginning and specify each node and        link definition and manipulate the data to accommodate the        visualization system.

SUMMARY OF THE INVENTION

In one aspect, a method of providing a network graphical representationof two or more database records includes selecting the two or moredatabase records according to one or more descriptive criteria. Each ofthe two or more database records are members of a common record class.The method further includes identifying one or more attributes of therecord class, and associating network nodes to instances of the one ormore attributes from the database records. The method also includesconnecting the network nodes with network links that designate networknodes having common instances of the one or more attributes.

The common record class may include patent records from a database, suchas a LexisNexis database, a Thomson database, a USPTO database, an EPOdatabase, or a Derwent database.

The common record class may include academic journal articles from adatabase, such as a PubMed database.

The descriptive criteria may include for example (i) one or more keywords within a body field of each of the patent records, (ii) one ormore key words within a title field of each of the patent records, (iii)one or more inventors in an inventor field of each of the patentrecords, (iv) one or more assignees in an assignee field of each of thepatent records, (v) one or more key words within an abstract field, orcombinations thereof.

The attributes may include for example inventor, assignee, filing date,issue date, IPC code, USPC code, or field of search.

The network links may include a characteristic that describes an amountof common instances occurring between connected nodes. Thecharacteristic may include for example link thickness, link color orlink texture.

The network nodes may include meta-nodes, which describes acharacteristic of two or more database records.

The method may further include iteratively executing the identifying andconnecting steps while modifying the one or more descriptive criteria,so as to change the selected two or more database records. The one ormore descriptive criteria may include for example a range of dates.

The method may further include selecting additional database recordsfrom a record class other than the common record class of patentrecords, and associating network nodes, network links, or both, toinstances of one or more attributes from the additional databaserecords. The other record class may describe for example licensinghistory associated with the patent records, litigation historyassociated with the patent records or maintenance fee history associatedwith the patent records.

In another aspect, a method of providing a network graphicalrepresentation of two or more database record includes selecting the twoor more database records according to one or more descriptive criteria.The method further includes identifying two or more common attributes ofthe database records, and associating network nodes to instances of afirst one of the common attributes from the database records. The methodalso includes connecting the network nodes with network links thatdesignate network nodes having common instances of one of the two ormore common attributes, so as to form a first network graphicalrepresentation. The method further includes transforming the firstnetwork graphical representation into a second network graphicalrepresentation by associating the network nodes to instances of a secondone of the common attributes from the database records, and connectingthe network nodes with network links that designate network nodes havingcommon instances of the second attribute.

In another aspect, a method of providing a network graphicalrepresentation of two or more database records includes selecting thetwo or more database records according to one or more descriptivecriteria. The method further includes identifying two or more commonattributes of the database records, associating a first set of networknodes to instances of a first one of the common attributes from thedatabase records, and associating a second set of network nodes toinstances of a second one of the common attributes from the databaserecords. The method also includes connecting one or more members of thefirst set of network nodes to one or more members of the second set ofnetwork nodes with network links that designate associations between thenetwork nodes, so as to form a first network graphical representation.

In another aspect, a method of providing a network graphicalrepresentation of two or more database records includes selecting thetwo or more database records according to one or more descriptivecriteria. The method further includes identifying two or more commonattributes of the database records, associating a first set of networknodes to instances of a first one of the common attributes from thedatabase records, and associating a second set of network nodes toinstances of a second one of the common attributes from the databaserecords. The method also includes containing the second set of networknodes presented in a network configuration, within one or more of thefirst set of network nodes presented in a network configuration. Each ofthe second set of network nodes shares a common attribute instance withthe network nodes of the first attribute within which the second set ofnetwork nodes is contained. The method may further includes associatinga third set of network nodes with a third one of the common attributesof the database records, containing the third set of network nodespresented in a network configuration, within one or more of the secondset of network nodes presented in a network configuration. Each of thethird set of network nodes shares a common attribute instance with thenetwork node of the second attribute within which the third set ofnetwork nodes is contained. The method may further include associatingone or more additional sets of network nodes with other ones of thecommon attributes from the database records, and grouping one or moremembers of the additional sets of network nodes within other networknodes, such that each group of network node members is characterized bythe attribute associated with the grouping network node.

In another aspect, a Network Visualization System includes a computerreadable medium with stored instructions adapted for providing a networkgraphical representation of two or more database records. The storedinstructions implement the steps of the one or more methods describedherein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows one prior art example of a graphical representationgenerated using result filtering.

FIG. 2 shows another prior art example of a graphical representationgenerated using result filtering.

FIG. 3. shows yet another prior art example of a graphicalrepresentation generated using result filtering.

FIG. 4. shows a prior art example of a spatial visualization tool.

FIG. 5 shows another prior art example of a spatial visualization tool.

FIG. 6 shows a common process of working with prior art visualizationtools.

FIG. 7 shows how database records can be converted to link data in adescribed embodiment.

FIG. 8 shows a simple network diagram based on a small number ofdatabase records.

FIG. 9 shows network link representation in one described embodiment.

FIG. 10 shows a simple network of patent documents where the documentswith the same assignee are clustered together.

FIG. 11 shows an example of a chronological network graph for onedescribed embodiment.

FIG. 12 shows a network generated by one described embodiment thatrepresents a single article from the PubMed database.

FIG. 13 shows a network generated by one described embodiment withoutusing meta-nodes.

FIG. 14 shows a network generated by one described embodiment usingmeta-nodes.

FIG. 15 shows how the records of FIG. 7 are converted to meta-nodes.

FIG. 16 shows an example of a network in which the links betweenassignee nodes and inventor nodes are based on whether the inventor hasinvented on a patent held by that assignee.

FIG. 17 shows a graph with meta-nodes representing IPC codes.

FIG. 18 shows a fractal network graph produced by one describedembodiment.

FIG. 19 shows a user interface produced by one described embodiment, andFIG. 19A shows the results of grouping multiple assignee names under asingle assignee name that represents the group

FIG. 20 shows the relationships between attributes of patent data andattributes of academic literature.

FIG. 21 shows a typical network based on PubMed data and visualizedusing the MNVS described herein

FIGS. 22-25 show an alternative view of the network shown in FIG. 21.

FIGS. 26-27 shows a network, produced by one described embodiment, ofdifferent clusters of collaboration.

FIG. 28 shows a network, produced by one described embodiment, of asearch limited by geography.

FIG. 29 shows a network, produced by one described embodiment, of asearch limited by organization.

FIG. 30 shows a network, produced by one described embodiment, whichillustrates research synergies or substitutions across organizations.

FIG. 31 shows a network, produced by one described embodiment, whichillustrates regional bases of research strength.

FIG. 32 shows a computer implementation of the described embodiments.

DETAILED DESCRIPTION

As used in the embodiments described herein, a Network VisualizationSystem (NVS) is a system and/or method for making sense of sets ofrelated database records or documents by providing a network graphicalrepresentation of the database records. These methods and/or systems canbe applied to any database with records where a relationship can beestablished between and among the records. Examples of some domains inwhich the NVS can be applied include, but are not limited to, patentdocuments, academic articles/papers/journals, medical/scientificarticles/papers/journals, literature, web pages, corporate databases ofcustomers/products/suppliers/sales, corporate knowledge managementdatabases, retail databases, government databases of censusinformation/economic data/etc, organization databases ofmembership/subscribers/organizational affiliation and many others. Infact, any information that is or can be structured as a table ofinformation with two or more fields of information can be visualized asa network using the invention.

An important insight is that database records/documents are related toeach other by various attributes that can be represented as a network.The kind of attributes that can be used to create a linkage relationshipamong records/documents can include, but are not limited to, citationlinks (e.g. documents A and B are linked because document A citesdocument B), co-citation links (e.g. documents A and B are linkedbecause document C cites both document A and B), bibliographic coupling(e.g. documents A and B are linked because document A and B both citedocument C), common authorship, common affiliation (assignee, company,journal, etc.), common classification within some static or dynamictaxonomy, common keywords, semantic similarity, and many other possiblelinks. These linkages make it possible to represent a set of databaserecords or documents as a network that enables the use of variousnetwork statistics and visualization tools to help the user make senseof the selected information.

Each database record is characterized by specific instances of theseattributes. For example, for an attribute of “inventorship” in a patentdatabase records, an “instance” of that attribute might be “John Smith,”i.e., a particular inventor. So for example, two patents sharing thesame instance “John Smith” of the attribute “inventor”, can beconsidered linked, and therefore visualized as part of a network.

A detailed description of the NVS is set forth below, which describes amethod for converting database information or documents into networkinformation and then describes a method for creating multiplevisualizations of the network. Further disclosed are two selectedexamples of applications of the NVS to specific databases: the patentdatabase and medical journal databases. It should be understood thatthese are only exemplary embodiments, and those skilled in the art willunderstand that the specific methods described in each can be applied tothe others, as well as to any other database with records/documentswhere a relationship between documents can be established via variouslinkage types as described below.

The dominant visualization paradigm in the NVS is the network. A networkis a collection of objects that are in some way connected to each other.It is common to visually represent a network using a “graph”. A networkgraph is a visual representation in which each object in the network isrepresented by an icon or emblem known as a node and each connectionbetween the objects is represented as a link (also known as an edge ortie) that visually connects the nodes. These nodes and links can be laidout in such a way as to provide a visual representation of therelationships among the various objects in the network.

Finding an appropriate way to lay out a network graph to reveal therelationships among the objects is not a trivial task. Graph theory andnetwork layout algorithms are well-established areas of research.Various layout algorithms have been developed to create useful visualrepresentations of a network. It is not an intention of the NetworkVisualization System to improve upon existing graph layout methods. Inaccordance with various embodiments of the NVS, any network layoutmethod can be utilized as a way to visualize the various importantattributes of a large set of related patent documents or databaserecords. The NVS, makes use of existing layout algorithms in order todisplay network graphs of a set or database records or relateddocuments.

The network paradigm has been chosen as a basis for visualizationbecause one of the key attributes to be understood about a large set ofdocuments/database records is the relationship or relationships thatexist among and between them. Network visualization by its nature isdesigned to reveal relationships and is therefore a very useful tool inunderstanding large document sets.

Acquisition of Data

The first step in utilizing the NVS is to acquire the documents/databaserecords to be examined. Several means can be employed to access a set ofdocument/database records for analysis. Data that is stored in anelectronic data repository (either within the same computer system or inone or more remote onsite or offsite servers) can be accessed by the NVSin its entirety, or as a subset of records. This can be accomplished byelectronically submitting a user query based on one or more descriptivecriteria via a computer implemented or assisted search. The query cansubmitted using normal, well-established Boolean syntax, and may beperformed by searching for user specified terms within one or morefields of the database records or within the entire “full text” of thedatabase record. The entirety of the data or the query result can thenbe analyzed and visualized by the Network Visualization System. In oneembodiment, the database records within the electronic data repositoryare all members of a common record class, such as patents from theUSPTO, EPO, Aureka, Micropatent, Thomson, Lexis Nexis or Derwentdatabase, or academic journal articles contained in one of the manyacademic, scientific, engineering or medical document databases like forexample, the PubMed database. In other embodiments, the database recordsare members of two or more record classes.

Data that is not stored in electronic data-repositories can also beanalyzed by the NVS, however, the data must first be converted into anelectronic format through data entry, OCR (optical characterrecognition), or other appropriate techniques. Once the data isconverted into electronic form, it can then be analyzed as any otherdatabase using the NVS.

Transforming Database Records into Network Data

The data extracted from the data repository as described above is merelya set of records. This data would not be considered “network” data byany known network visualization tool within the prior art. This isbecause it is not structured as a node list and a link list (or matrix).The data is simply a collection of records with each record having twoor more fields that represent attributes of the record. For example, acorporate customer database might have fields for customer ID, name,street address, city, state, zip, country, telephone number, e-mailaddress, and many others. While this data is useful as a node list whereeach record is viewed as a node, there is no link list, so the datacannot be represented as a network graph.

The Network Visualization System converts this data into network data bycreating a link list for each record-linking attribute. This isaccomplished by creating a link between each record that shares a commonattribute value. The chart shown in FIG. 7 provides a simplified exampleof how database records can be converted to link data.

In this example, we use a very simplified set of patent databaserecords. For attributes like Assignee or Inventor, links are created foreach pairing of records that share the same attribute value. These linksdo not have directionality as they are simply based on co-occurrence ofattributes values. Citation links, if provided as a list, must be parsed(in this case based on comma delimiting) and then directional linksassigned between citing and cited patents as shown in the example.

This method can be used to convert any set of database records intonetwork data where the original records are the node list and the linklists are created based on common instances of one or more attributes asdescribed above. Once the database records have been converted intonetwork data, the NVS allows the user to visualize the network.

Basic Database Networks

In a basic database network, each record can be represented by a nodeand the nodes can be connected to each other by links that represent oneor more of the various types of linkages described above. FIG. 8 shows avery simple network diagram based on a small number of database records.

Representing nodes—Nodes can be displayed as a basic shape (e.g., acircle, oval, square, rectangle, etc.) or an icon (e.g. a picture of adocument). The color of the nodes can be changed to represent someattribute of the nodes. The node can also be labeled with text thatidentifies the record or displays one or more attributes of the databaserecord. As a practical matter, long node labels tend to make the networkdisplay unwieldy. The NVS addresses this issue in several ways. First,multiple options are provided to the user for how much of the desiredattribute value to display within the node. Options include: Full (theentire value), Short (the first word or first “n” characters), Point(two digit year only) and None (no label). The problem is furtheraddressed by allowing the user to see the full node label whenever theypoint to, or select a node (using an electronic pointing device such asa mouse or trackball).

Representing links—Links can be represented by a line or an arrow. See,for example, FIG. 9. The display of the links can be made to reveal thedirection of the linkage by attaching an arrow head, using a triangularshape with the apex pointing toward the cited document, or by usingdifferent colors or line styles (e.g. dotted, solid) to representforward and backward citations. Furthermore, the strength of theconnection between nodes can be visually depicted by varying thethickness of the line or by altering its color or style. The strength ofthe link can also be depicted by displaying a value associated with eachlink's strength close to the link on the network diagram. Various typesof links between nodes can be established as described above.

As a practical matter, when multiple links connect the same two nodes,it is difficult for the user to differentiate the various ways in whichthe two nodes are related. The NVS resolves this problem in a number ofways. First, different link types are displayed so that they arevisually different. This is accomplished by showing the different linktypes with different colors, line styles (e.g. dotted, solid, dashed),line thickness, etc. The multiple links between the nodes are alignedside by side so that multiple link types can be displayed withoutoverlapping.

Another technique for resolving the problem of multiple link types is tocollapse the links into a single “composite link,” and to attach iconsto the link that show the different types of links and strength of theties. FIG. 9 shows an example of how multiple link types can becollapsed into a single link with icons.

The strength of this composite link between the two nodes can becalculated in a number of ways. It can be based on simply the number oflinks, the sum of the strengths of the combined links, or a weightedaverage of the combined links. If a weighted average is used, theweighting factors can be chosen based on an estimation of the relativeimportance of each link type.

Another feature of the NVS is that the user is provided with a means forselecting which link types are active or inactive, and which are visibleor invisible. By way of definition, “active” links are those links thataffect the layout of the network graph. In other words, they have aforce (like a spring or elastic band) which draws the linked nodestogether. However, not all active links need to be displayed in thevisualization. When a network is highly clustered (i.e., there is a highconcentration of links) or multiple link types are used simultaneously,the network graph may become cluttered with links. By allowing the userto make links invisible, it allows the user to remove the links from thegraph diagram while continuing to have those links effect the graphlayout.

Navigating the network—The NVS provides several means for navigating thenetwork including, but not limited to:

-   -   Selecting the radius around a node—One way to navigate the        network is to choose the “radius” around a selected node. In        this context, radius is the number of links between the selected        node and another node. For example, if the radius is set to        three, all of the documents that can be reached by less than        three links from the selected document will be displayed in the        network graph.    -   Expand—The network can be expanded to add additional nodes to        the network graph. The user can, e.g., select one or more nodes        (which represent documents or database records) and choose        “Expand”, and all the nodes that can be reached via a single        link from that/those nodes (but are not yet visible) are added        to the network graph.    -   Contract—The network can be contracted to remove nodes from the        network graph. The user can for example select one or more nodes        (which represent documents or database records) and choose        “Contract”, and all the nodes that can be reached in a single        link from that/those nodes and are not in any other way linked        to the network are removed from the network graph.    -   Hide—Nodes can be hidden from the network graph. By selecting        one or more nodes and selecting “Hide”, the selected nodes are        removed from the network graph.

Filtering the network—Another useful feature of the NVS is the abilityto filter the nodes that are represented within the network. This can beaccomplished in several ways.

1) A filter can be applied by specifying the minimum, maximum or rangeof values of an attribute of the patent documents to be displayed. Forinstance, the documents can be filtered to represent only those recordsthat meet a particular set of criteria such as:

-   -   Dated before or after a particular date    -   Only display nodes where a particular attribute appears more or        less than some specified minimum number of times within the        record set (e.g. Only display document nodes for authors who        have at least 5 documents in the dataset)

2) A filter can be applied by specifying values of node attributes to bedisplayed. For instance, the documents can be filtered to represent onlythose documents:

-   -   Related to one or more companies    -   Written by a particular set of one or more authors    -   Classified within a particular set of one or more topics based        on some fixed or dynamic taxonomy

3) A filter can be applied by providing the user with a list ofattribute values and allowing the user to select or deselect attributevalues to be displayed.

4) A filter can be applied by providing the user with a means to selectone or more nodes from the network visualization using a computerpointing device (such as a mouse or trackball) and choosing a commandfrom a menu that designates that the selected nodes should be filtered.

By using these filtering methods individually, or in combination, it ispossible for a user to dynamically filter a dataset to display only thedocuments that are of interest. For example, a user could specify thatshe wanted to see only documents by Companies A, B, and C publishedbetween the years 1999 and 2005 that were classified with the categoriesA1, C3 and D5.

This capability is important for two reasons: 1) it allows the user tomove back and forth between different subsets of documents within thedataset, and 2) it enables the user to refine their query to eliminatedocuments which are not of interest to them.

Clustering by attribute—Another method of revealing different kinds ofrelationships among documents in the network is to cluster them based ontheir attributes. One way to accomplish this is to place additionallinks between nodes that share a particular attribute value. Forexample, all patents with the same assignee can be linked to each otherby additional links so that they will be attracted to each other andform a cluster. FIG. 10 shows a simple network of patent documents wherethe documents with the same assignee are clustered together.Alternatively, an additional node can be introduced to the graph thatrepresents the attribute value with each node having that value linkedto the new node. This effectively draws all of these nodes into acluster. Note that it is not necessary to display this new “attributenode” or the links between the attribute node and other nodes within thevisualization.

Identifying natural clusters within the network—A network of linkeddocuments will naturally have areas that are more highly clustered ortightly grouped than other areas in the network. “Degree of clustering”is a term of art within the field of social networking, and well knownstatistical methods exist for determining the degree of clusteringwithin a part or the entirety of a network. These clusters can beidentified using techniques developed as part of the Social NetworkAnalysis field. Various techniques to identify clusters exist. It is notthe purpose of the NVS to improve upon the known clustering techniques,however, the NVS makes use of various clustering techniques to identifyrelated groups of patents in order to provide insight into the nature ofa large set of documents.

Once the clusters have been identified, they can be labeled. One methodof labeling is to identify words that are found in all or many of thetitles and abstracts of the documents falling in a cluster. By stringingtogether the top several words (typically 1-5), a label can be createdfor each cluster. This label can provide a signal to the user about thecontent of each cluster. Since the listing of most frequently used wordsin the cluster is not likely to be the ideal cluster label, it ispractical to provide the user to with a means to change the clusterlabel to a more meaningful set of words.

Chronological network graphs—Another way of revealing information abouta set of documents is to display the network graph in such a way thatthe nodes are sorted by date. Dates used could be any date associatedwith the document or database record. For example, patent documents havemany dates associated with them; priority date, application date,publication date, grant date, expiration date as well as other dates.For example, the network graph can place all of the oldest documents onthe left side of the graph and the newest documents can be placed on theright (or vice versa). A timeline can be placed alongside the graph toshow the progression of technological development over time.Alternatively, the background of the network layout can be divided intosegments by year, decade, or some other time division and labeled withthe documents falling into that range appearing in the appropriatesegment. FIG. 11 shows one example of such a chronological networkgraph.

Other gradients—Networks can also be displayed along any number of othergradients other than time. Any attribute of the nodes or meta-nodes thatis quantitative (or can be made quantitative) can be used as a gradienton which to display the network visualization. One simple example of analternative gradient is a network of customer data sorted by the annualspending of customers.

Transforming the Network Using “Meta-Nodes”

One of the central novel features of the embodiments described herein isthe ability to transform the network representation. Prior art networkvisualization tools maintain a fixed, stable definition of what is anode and what is a link. For example, if data on a set of patentdocuments is introduced into one of the prior art network visualizationtools, it is necessary to precisely define what is a node and what is alink. If, for example, you choose to have each patent represented as anode, and co-inventorship represent links, then the visualization toolwill maintain that node/link definition without variation during theperiod of analysis. The NVS is fundamentally different in that itenables the user to transform the network by redefining the definitionof what is a node and what is a link as they use the data.

The Network Visualization System operates on the principle that anyattribute of a database record can be represented as a node, a link, orboth. As a simple example, if conference organizer had a list of thevarious conference workshops and the attendees for each workshop, shecould visualize them as a network of workshops linked by commonattendees, but she could just as readily visualize them as a network ofattendees linked by the workshops they attended together.

To take the point to its extreme, even a single database record can beviewed as a network with each attribute represented by a node and withthe various attributes linked based on other common attributes. FIG. 12shows a rather complex network that represents a single article from thePubMed database. The central node represents the article itself, whilethe various attributes of that article are represented by other nodesand connected to each other based on links such as co-authorship links,and other co-occurrence links because the attributes both occurred onthe selected article.

The Network Visualization system not only converts database informationinto network information, but it allows the user to create her own nodeand link definitions, combine any number of nodes and links on a singlenetwork, and change those definitions at will during the period ofanalysis. Under previously known methods, it is not possible totransform a network visualization in this way.

The ability to create alternative node definitions is a powerful toolfor simplifying a network display and developing insight about thedataset. By redefining the nodes and links in the network display, theuser can focus her attention on the entity of her interest. For example,a researcher analyzing patent data can focus on companies, industries orinventors rather than patents. These nodes represent higher-levelentities than nodes that represent single documents or database records.We call these higher-level nodes “meta-nodes,” because they representgroups of documents or database records rather than single records.Links between these meta-nodes we call “meta-links”, because theyrepresent an aggregation of the links between the collections ofdocuments or database records represented by the meta-nodes. Thisability to abstract the network to a “meta-level”, enables the user toanswer questions and inform decisions at a higher level than is possibleusing any other known visualization method.

The power of the network transformation method can be demonstrated withan example. Imagine a complex network of >1,000 patent documents inwhich the nodes are patent documents and the links are citationlinkages. The network graph might look something like the picture inFIG. 13.

It is difficult to determine what is to be understood from this networkgraph. However, if you transform the network by redefining thedefinition of the node so that each node is a company, you end up with anetwork diagram like the one shown in FIG. 14.

This network diagram of patent documents, related to a particularphotographic technology, makes it easy to identify the leading companiesin the technology domain and understand the connections between them. Bytransforming the network diagram, it has been greatly simplified, andcan therefore enable greater insight.

Previously, we described an embodiment that converted database recordsinto network data. That embodiment relied on a stable node definition,i.e., that each database record was a node. Another embodiment createsmeta-node data and meta-link data from the database records. The exampleshown in FIG. 15 uses the same simple set of database records shownearlier in FIG. 7 to demonstrate how this is accomplished for creatingAssignee meta-nodes and meta-links.

The first step in the process is to create a meta-node list, which isdone by simply listing each unique value of a particular recordattribute and noting the number of times that value appears in thedataset. One or more meta-link lists are then created for this attribute(in this example Assignee) based on the co-occurrence of values in theother attribute fields (e.g. Inventors, IPC classes and Citations). Themethod is identical to the method employed for creating a link list asdescribed above, but with two exceptions. First, the “record” in thisinstance is not an actual record from the database, it is a record fromthe meta-node list just created, and Second, the meta-links havelink-strength values denoting the number of co-occurrences (orcitations) aggregated in that link.

The creation of link lists, meta-node lists and meta-link lists fromdatabase records makes it possible to view database information fromliterally any database as a network using the Network VisualizationSystem described herein. As a practical matter, the described embodimentof the NVS does not actually convert every possible attribute into alink list, nor does it convert every attribute into a meta-node list ormeta-link list. Rather, only the attributes that are most useful for thepurpose of the user are converted into network data.

It will be obvious to one skilled in the art that there are alternativemethods for choosing which attributes to convert, and at what step ofanalysis to make that choice. For example, it is sometimes desirable todefine in advance as part of a computer program which attributes toconvert into link and meta-node data for a particular database ofinterest. This allows a user access to a standard set of nodes,meta-nodes, links and meta-links to work with during her research withthe tool. The network can be filtered, transformed, and viewed withmultiple nodes, meta-nodes and links, but only within the bounds of theattributes established for the particular data set under analysis.

Alternatively, it is possible to give the user the ability to selectattributes from the database records to convert into link, meta-node andmeta-links during the period of her analysis. This can be accomplishedby simply allowing the user to choose attributes (fields) from a listfor conversion to network data. Once attributes are selected, links,meta-nodes and meta-links can be generated according to the methoddescribed above and added to the set of network visualization resourcesthat are available to the user.

There are also other ways to create meta-node and meta-link informationfrom database records. The following examples illustrate two alternativeways of creating such information, although other ways beyond these twoalso may be used.

EXAMPLE 1

Meta-nodes and meta-links can be created based on ranges of attributevalues. For example, if there is an attribute of the database recordthat is numeric (e.g. in a customer database a field which recordsannual sales), then meta-nodes can be created based on ranges of valuesfrom within that attribute field (e.g. <$200=Small spenders,$200-$1,000=Moderate spenders, >$1,000=Big spenders).

EXAMPLE 2

Meta-nodes and/or meta-links can be based on combinations of multiplerecord attributes. For example, a database of marketing survey resultscan be converted to network data where specific categories of customerscan be grouped together based on their common answers to a group ofquestions. In this way, it is possible to define a meta-node for “SoccerMoms” based on an income of >$50,000/year, number of children>=2, andcar type=SUV or Minivan.

The network diagram in FIG. 14 also illustrates two additionalattributes of the Network Visualization System; meta-node sizing andmeta-link aggregation.

Meta-node sizing—In the network graph in FIG. 14, each meta-noderepresents not a single patent, but all of the patents sharing a commonvalue of the assignee attribute. In other words, each node representsall of the patents that were filed by the same assignee. In thisdiagram, the size of the node is based on the number of patents assignedto that particular assignee, and numbers are attached to the meta-nodesto display the value associated with the meta-node size.

Another feature of the NVS is to provide the user with the ability tosize meta-nodes based on various attributes of the representeddocuments. For example, in a customer database with annual spending, thenode might be sized based on the sum (or average) of annual spending forall customers represented by the meta-node. Meta-nodes can also be sizedbased on any number of network statistic calculations like the sum ofcentrality/eigenvector centrality/betweeness centrality for therepresented nodes. The ability to size nodes based on these variousmetrics enables the user to draw conclusions about things like nodevalue, and other important measures of interest to the user.

Many of the possible attributes used for sizing meta-nodes can beapplied to single document nodes as well as to meta-nodes. Of particularinterest are the citation attributes (forward citations, backwardcitations, total citations) and the social network statistics oncentrality (centrality, eigenvector centrality, betweeness centrality).Sizing nodes based on these and other statistics can provide a signal ofvalue of the nodes within the network.

Meta-links aggregation—Another feature of the NVS is the ability totransform the network from one with binary (off/on) links to a networkwith meta-links (combined links with differing degrees of strength).This aggregation of links into meta-links also provides further insightto the user by revealing both the strength and nature of therelationship between the meta-nodes

In the case of the example shown above, the links represent citationsbetween the assignees. Multiple links are shown because citations canflow in either direction between the companies. The values of the linksin this example are based on the total number of citations between onecompany's patents and another company's patents. This reveals who inthis “innovation network” are the leaders, and who are the followers.Arrowheads are attached to the links to show the direction of the links,and numbers are attached to show the value associated with the linkstrength. Further, in a preferred embodiment, when a user points (usingan electronic pointing device such as a mouse) at a particular node, theincoming and outgoing links are highlighted with different colors toprovide a visual clue as to whether the selected company is a leader(highly cited) or a follower (citing others).

As with node-sizing, link-strength can be based on a variety ofdifferent linkage attributes. Some examples include number of citations,number of unique documents cited, number of documents citing, averageage of citation, age of most recent citation, as well as others. Alsorecall that meta-nodes can be connected by a wide variety of link typesas described above. These links can also be agglomerated and thestrength of their ties can be determined based on similar metrics tothose described here.

Simultaneous Display of Multiple Node, Meta-Node and Link Types

The next extension of the meta-node concept is to simultaneously placemultiple node and link types on the same graph. For example, in thepatent context it is particularly revealing to see a graph that containsnodes representing both assignees and inventors. FIG. 16 shows anexample of a network of assignees and inventors in which the linksbetween assignee nodes and inventor nodes are based on whether theinventor has invented on a patent held by that assignee. In the networkshown in FIG. 16, it is visually obvious which inventors work for whichcompanies, and which have worked for multiple companies within the scopeof the technology domain being examined.

As another example, nodes representing patents and meta-nodesrepresenting IPC (International Patent Classifications), USPC (UnitedStates Patent Classifications), and/or Derwent classes can be displayedon the same graph. FIG. 17 shows a graph with meta-nodes representingIPC codes and the patents grouped as members of particular IPCs. If afilter is set to view only patents from a particular assignee, thisembodiment allows a user to visually determine what technologies thatassignee is investing in over time, and how those priorities havechanged.

Nodes and links representing different attributes or types ofconnections can be visually differentiated from each other in order toincrease the usability of the system. Nodes can be differentiated byshape, color, border type, fill pattern, or by representing each with aparticular icon such as a person for inventors and a picture of adocument for patents. Links can be differentiated by shape, color, linetype (e.g. solid, dotted), or other means.

The NVS allows the user to select which nodes and meta-nodes to displayon a graph and also choose which linking attributes are used as thebasis for linking. This provides a powerful tool for exploring a largeset of patents and understanding their content and the relationshipsamong them.

Fractal Networks

Another extension of the meta-node concept is the concept of fractalnetworks. Fractal networks are here defined as networks of meta-nodesthat contain within each meta-node a network of other nodes ormeta-nodes, as shown in FIG. 18. This fractal node representation canhave as many layers as desired.

An example of the use of fractal nodes would be a network of meta-nodesthat represent assignees where each node is sized by the number ofpatents it represents. Within each assignee meta-node, a network ofmeta-nodes can be displayed which represent IPC classes. Thisrepresentation would show which IPC classifications are being developedby each assignee company in the patent set. Further, within each IPCmeta-node, a network can be displayed that represents inventors. Andwithin the inventor meta-nodes, a network can be displayed thatrepresents patents.

This network representation allows a user to ask and answer a widevariety of questions about the patent documents in a technology domain.It makes it possible to delve into the attributes of and relationshipsamong the patents in a document set in a way that is otherwiseimpossible. An ideal implementation of fractal nodes provides the userwith the ability to select the attribute and linking attributerepresented by the network at each level of the fractal network graph.

Fractal nodes are also particularly useful in displaying hierarchicalattributes such as various classification schemes including IPC and USpatent classifications in the patent domain, Medical Subject Headings(MeSH) in the medical data domain as well as categorizations such asVivisimo categories. Lower levels of the hierarchy can be representedwithin the nodes that represent higher levels of the hierarchy. Thisvisual representation provides an intuitive way for users to understandthe relative size of each category and sub-category as well as therelationships between and among them.

Up to this point, fractal networks have been described assuming that ateach level of the hierarchy, only a single type of node or meta-node isdisplayed. Additional insights can be generated by providing a means forusers to place multiple node and link types at each level of thehierarchy. Users can then more deeply explore how the various attributesare related to each other. As an example of the power of thiscapability, a user can display a network of assignee meta-nodes andwithin each meta-node, display meta-nodes representing both inventorsand IPC classes. By so doing, the user can quickly understand whattechnology areas each company is working on and who the key inventorsare within those areas of technology.

One challenge in using fractal networks is the fact that the“sub-networks” within each meta-node are likely to be very small withinthe network display. To address this problem, the system allows users tozoom in and out of the network in order to display the detail at anylevel of the fractal network that they choose. This is accomplished inone of two ways. First, the user can select a level of magnificationfrom a toolbar button or menu selection. Second, the user can zoom in onthe fractal network within a particular meta-node simply by selectingthe meta-node from the network display. By selecting a meta-node (usinga mouse or other electronic pointing device), the system canautomatically center that meta-node and zoom in so that the next levelof the fractal network can be clearly seen. To zoom back out again, theuser can either select a new level of magnification from a toolbarbutton or menu selection, or they can click outside the meta-node toreturn to the previous level of magnification.

Implications of Filtering on Meta-Nodes and Meta-Links

As described above, various means are available to the user to filterthe database records under examination. This filtering has an importantimplication for the use of meta-nodes and meta-links, namely, that themeta-node list, meta-node size, meta-link list and meta-linklink-strength all are subject to change each time a filter is applied.It is worth noting that each time a filter is applied to the data, thatthe meta-node and meta-link information must be updated in order tomaintain consistency between the set of records under examination andthe values associated with the meta-nodes and meta-links in the networkdisplay.

Providing Statistical Information about the Network

Another element of NVS that aids the user in making sense of largedocument sets is a clear presentation of statistical information aboutthe set of documents under consideration. The user interface previouslydescribed allows the user to interact with the network, expanding andcontracting it to create the network that represents the area of theuser's interest. In the NVS, an interface is provided that updates thestatistical information about the network dynamically as the networkunder consideration changes.

A variety of statistical information about the network is provided bythe system including, but not limited to:

-   -   Document/record count    -   Meta-node count (e.g. number of assignees in displayed network)    -   Sums of node attribute values (e.g. total sales to all customers        in the network)    -   Document count by meta-node category (e.g. a list of articles        per author)    -   A graph of documents per year (e.g. articles per year by year of        publication)    -   Other network statistics Other network statistics can also be        provided including but not limited to statistics about the        network (e.g. density, diameter, centralization, robustness,        transitivity), measures about clusters within the network (e.g.        cliques, ego networks, density), and metrics about the nodes        (e.g. centrality (e.g. betweenness centrality, eigenvector        centrality), equivalence), and many others.

The statistical information described above can be provided by userrequest from a menu or toolbar selection, or it can be provided in aseparate window or pane within the interface. In a preferred embodiment,a separate pane is provided with tabs to allow the user to access thedesired information about the current network. This pane, as shown inFIG. 19, can be expanded, contracted or closed at the user's discretion.

The interface further facilitates understanding by allowing the user toselect one or more categories and highlighting the related nodes in thenetwork visualization.

Additional information is also provided on a context sensitive basis asusers use the system. Specifically, pop-up windows are available toprovide additional information about the various nodes, meta-nodes, andlinks and meta-links in the network graph. The information provided ineach pop-up window is relevant to the object selected from the graph(note that more than one object of the same type (node, meta-node basedon the same attribute, link of the same link type) can be selected at atime.

Resolving Ambiguous Attribute Values

One of the problems encountered in using the network transformationmethod as described above is the need to resolve ambiguous terms.Database managers or users will understand that the data contained indatabase records is often messy and inaccurate. Often, attribute valuesthat represent the same value are not the same because of minordifferences in the text. We have found that inventor names and assigneenames (as well as other attributes) often appear in the patent databasein different forms. For example, the assignee “IBM” may appear as IBM,IBM, Inc., International Business Machines, International BusinessMachines, Inc. and in many other variant forms.

This creates a problem when using the meta-node and meta-link methoddescribed above, because these small differences in form cause thesystem to create multiple meta-nodes/meta-links when they shouldactually be combined. Therefore, the tool provides several means for theuser to combine attribute values into a single value.

The system provides a means for users to resolve ambiguous attributevalues by allowing them to group attribute values together under asingle value. FIG. 19A shows the results of grouping multiple assigneenames under a single assignee name that represents the group. Severalmeans are provided to accomplish this. The first method is to allowusers to select attribute values from a list and combine them under anew name or attribute value. For example, a user is presented with analphabetical list of assignees and selects IBM, IBM, Inc., InternationalBusiness Machines, International Business Machines, Inc. from the list.She can then group the selected items together using a toolbar button ora menu selection and either selects the systems suggestion for a groupname (e.g. IBM) or types her own group name. The system will thencombine all of these names under the new group name and display it as asingle assignee for the purpose of all analysis.

A second method provides users with suggested collections of attributevalues to be combined into groups. The system compares the similarity ofthe attribute values and suggests groups to be clustered together undera single attribute value. In addition to using the attribute underconsideration, (e.g. Assignee), the tool also examines other attributevalues for clues that the attribute values should be combined. Forexample, if IBM, and IBM, Inc. are both located in Armonk, N.Y., or theyshare common inventors, the tool will suggest that they should likely becombined. The user can review each suggested group and add or removevalues from the list before choosing to accept the grouping.

The final method for resolving ambiguous attribute values uses thenetwork diagram itself. The user can select meta-nodes directly on thenetwork diagram (using an electronic pointing device such as a mouse).She can then combine multiple meta-nodes into a single group. This isaccomplished by selecting multiple meta-nodes and choosing a tool buttonor menu selection to group the values. Alternatively, the user can “dragand drop” a meta-node onto another meta-node thereby suggesting to thesystem that they should be combined. The system will prompt the user toensure that grouping those items is really the intention, and then itwill combine the attribute values into a single group for purposes ofanalysis.

The system also makes it possible to un-group attribute groups once theyhave been created. The user can select a meta-node, or choose a groupname from a list and review which attribute values have been combined.The user can then select specific attribute values for un-grouping andthen select a toolbar button or a menu selection to ungroup the selectedvalue or values.

In addition to the methods above, it is also possible to eliminateambiguity within attribute values through comparison with external datasources. When considering assignees, for example, reference can be madeto external lists of company name equivalents. These lists can alsoinclude subsidiaries and acquired companies which can then be suggestedto the user as possible groups for combining. In the medical domain,doctor names can be resolved using the doctor's DEA number.

This process of combining multiple attributes into a single attributevalue can also be beneficial in another way. By combining values intogroups, a hierarchy of values is created. This information can then beused according to the methods described above to display therelationships among data at different levels of the hierarchy.Specifically, attribute values at each level of the hierarchy can berepresented as meta-nodes and can be displayed as part of the networkdisplay either as separate nodes within the graph, or as a hierarchicalnetwork using the fractal network method described above. This method isparticularly valuable for hierarchical information likeparent/subsidiary information about assignees.

Animating Networks

The tools we have described up to this point allow a user to transformher view of a network in a variety of ways. However, the descriptions upto this point have assumed that each network graph is a snapshot as of aparticular point in time. In that way, the visualizations we havedescribed so far have been static.

Another important element of the network visualization system is theability to animate network graphs to reveal how they have changed overtime. There are several different capabilities of the system whichenable a user to examine the dynamics of network emergence.

The first method for revealing network dynamics is the ability torestrict the data displayed within the diagram based on a time period ofinterest. The user can select minimum and maximum date which establishesa date range for data to be displayed. The actual date used can be basedon any date information associated with the underlying database records.With patent data, a variety of dates can be selected including but notlimited to priority date, filing date, publication date, and grant date.Once the user has selected the date type and date range, the system thenfilters the data and displays the network graph based only on datameeting the specified parameters.

The second method builds from this capability. The system provides theuser with the ability to alter the date range in a very simple way. Theuser can select a “step amount” by which to change the date range (e.g.one month, one year) and then can click a single button to move the daterange forward or backward by that increment. Additionally, separatetoolbar buttons are provided to that the minimum date, maximum date, andboth dates can be adjusted in a single click. Once the user has clickedto alter the date range, the system quickly adjusts the data set toreflect the newly selected range and redraws the network graph. Thisallows the user to step through the time period of the data set atprescribed increments. Effectively, the system makes it possible tovisualize how the network has emerged over time.

The third method is the creation of an actual animation of the networkdevelopment. The system provides the user a method to enter an overalldate range, an initial date range (which may be no range at all—e.g. ifthe minimum and maximum dates are set to the same value), which dates(min, max, or both) will be changed, and the size of the date incrementto animate. The system uses these inputs to automatically step throughthe specified date range based on the increment provided and displays ananimation of the emergence of the network over time.

These animation methods are incredibly useful in revealing thedevelopment of the network, however, they create some challenges whichmust be overcome in order to make the system practical. First, when thenetwork being displayed is large (in that it has many nodes and/orlinks, and/or a large number of underlying records) the high degree ofcomputational complexity makes the animation slow and jerky on all butthe most powerful computer systems. In order to overcome thislimitation, a means is provided for the system to process the sequenceof network diagrams as a batch and then save them as a series ofsnapshots or as a video clip. The snapshots or video clip can then beplayed back at a speed selected by the user without the system having torecalculate the underlying data at each point in the animation. Thismakes it possible for the user to review the animation repeatedly andalso to pause, rewind, and fast forward the animation at will.

A second challenge related to animation of network graphs is thedifficulty of assimilating what is happening within the graph. When agraph is animated, new nodes and links appear, meta-nodes grow andshrink, and the nodes change position within the graph as the attractionbetween the various nodes and meta-nodes changes over time. All of thesesimultaneous changes make it difficult for the user to understand whatis happening as the animation unfolds. In order to make things simplerfor the user, the system provides a means to reduce the number ofparameters that are changing during the animation. Specifically, thesystem allows users to hold various parameters constant during theanimation. Parameters that can be held constant during the animationinclude but are not limited to:

-   -   Constant node position—One of the most difficult parts of the        animation to follow is the changing location of the nodes within        the network display as the animation runs. Therefore, the system        provides users with an option to maintain constant node        positions during the animation. In order to accomplish this, the        system first calculates the final position that each node will        hold at the end of the animation, and then, as nodes appear,        change sizes, and new links appear, grow and disappear, the        location of the nodes is held constant at this position.    -   All nodes present—Another option the system provides is the        ability for the user to keep all nodes present on the graph        during the entire animation. With this option selected, the        system keeps each node visible during the entire animation, but        provides a visual signal to differentiate nodes that do not        represent data falling within the date range captured within the        particular date range at each point in the animation. The visual        signal can be a difference in color (e.g. nodes that normally        would not be visible are gray), size, shape, border, or other        visible attribute. This allows the user to trace the path of        each node continuously during the entire course of the        animation.    -   Other parameters that could be held constant during the        animation include link presence, constant node size, and        constant link size.

The ability to hold any combination of these parameters constant duringanimation gives the user great control over the animation that isdisplayed and increases their ability to assimilate how the network isemerging.

Another capability that the system provides is the ability to providevisual information about the rate of change associated with variousparameters during the animation. Although the animation of networkemergence provides tremendous visual information about how the networkis developing, it is difficult to accurately compare and assess thechanges as they occur. For instance, while it is easy to see thatseveral technologies or company portfolios are growing, it is difficultor impossible for a user to assess which company portfolio is growingthe fastest at any point in time.

For this reason, the system provides users with a means to visualize therate of change of various parameters during the animation. Someparameters which users may have interest in during an animation are:node growth rates, link attachment rates, and the rate of change ofvarious measures of centrality of the nodes in the network. The systemprovides users with the ability to track these rates of change duringthe animation, and to display information about these values in a table,graph or directly in the network diagram. The user can select whichvariables to track and which nodes or links to track them for (includingall nodes and links if desired). This data can be displayed in a tableor a graph (bar graph or line graph) adjacent to the network diagram andupdated as the animation is displayed. Additionally (or alternatively),the data can be used to alter the appearance of the nodes or links inthe network display as the animation is played. We have found it usefulto alter the color of nodes or links based on rate-of-change data orcentrality statistics in order to display which portions of the networkdiagram are “hottest” (as measured by the parameters described above).Alternatively, the data could be used to alter the size of nodes orlinks or some other visible characteristic as the network animation isplayed.

One other useful capability is provided related to network animation.Often, users have an interest in a particular portion of the networkdiagram, and would like to have a deep understanding about how thatportion of the network is emerging over time. For this reason, it isuseful to provide a means whereby users can zoom in, and/or maintainfocus on a specific node or nodes during the animation. For example, acompany may be particularly interested in how their own patent portfoliohas emerged over time. A means is provided so that users can select aspecific node to zoom in on during the animation. It is useful toprovide one or more “picture in picture” displays so that the user canobserve the emergence of the overall network, as well as seeing how oneor more “zoomed” portions of the network are emerging. This isparticularly useful when the network being animated is a fractal networkand the user is interested in observing how a “sub-network” is emergingwithin the larger network.

Network animation is a very powerful tool for revealing the patterns ofemergence within a network. Within the context of patent data,animations reveal how technologies emerge, how companies' positionschange, how inventor's careers and collaborations change over time, andmany other features. This capability provides a significant contributionto the user's ability to make sense out of large data sets.

Linking to External Data

Up to this point, the network visualization system has been describedbased on the network analysis and insight that can be generated fromwithin a single data source, in this case, a single set of databaserecords. It should be evident based on the previous description thattremendous insight can be generated simply by using this endogenousdata. However, additional insight can be generated when other exogenousdata sources are used in conjunction with the patent data.

By linking to exogenous data sources, additional information can beobtained about the entities represented by the database attributes. Thechoice of which external data to link to and the value of linking tothat data depends on the context of the data source under review. Eachattribute of the database records creates the potential to link toadditional external data sources that can expand the availableinformation about the subject of interest. This additional data can beused to create entirely new meta-node classes, attach additionalattributes to one or more database or meta-node records, provideinformation about specific nodes or meta-nodes, and provide additionallinking information between nodes or meta-nodes. Specific examples ofthe kinds of exogenous data that is useful will be described within someof the preferred embodiments described later in this application.

Advantages of the Network Visualization System

The combination of these various tools and techniques provide a dramaticimprovement in enabling a user to quickly make sense of the documents orrecords contained within a large dataset. First, the ability to rapidlyidentify and refine a collection of documents through dynamic filteringmakes it possible for users to identify a set of documents related totheir area of interest much more quickly, efficiently, precisely andwith less need for specific technical knowledge.

Second, users have the ability to explore a large set of documents ordatabase records so that an understanding can quickly be developed ofthe nature of activity in the domain. This is made possible throughtechniques that provide summary information about the domain from anumber of different perspectives. The combination of the network lensalong with meta-nodes to represent the various attributes of thedocuments provides an intuitive way to understand not only the groupingsinherent in the domain, but the relationships among those groups.

Users can advantageously explore their domain of interest at any levelof detail desired, and move back and forth between summary levelinformation and detailed information seamlessly.

EXAMPLE 1 Method and Apparatus for Making Sense of the Patent Database

One or more embodiments of the invention relate to an improved methodfor making sense of patents in a patent database. One attribute of thepatent database is that it is easy to establish the relatedness ofdocuments to each other based on their citation relationship. Thefollowing discussion on using citations as a basis for establishingrelatedness among patents applies equally to all databases that havecitations including but not limited to academic/scientific/medicalliterature as well as hyperlinks embedded in pages on the World Wide Webwhich can also be thought of as a type of citation.

Various embodiments of the present invention can be used to help abusiness person, engineer, scientist, attorney, patent examiner or otherinterested party, make sense out of a large set of patents. Thechallenge is to take a large set of patent documents and find ways tounderstand the technological developments they describe without havingto read them. To accomplish this, a method is provided by which the usercan visualize the various attributes of the documents and theirrelationships to each other.

Some of the questions that can be addressed using various embodiments ofthe invention include but are not limited to:

-   -   What technology(ies) is this group of patents about?    -   How quickly are the various technologies developing?    -   What are the hottest areas of technology in this domain?    -   What are the most recent developments?    -   Which companies are most active in developing these        technologies?    -   Which inventors are most active in developing these        technologies?    -   Which patents are most important?    -   How important are the patent portfolios of the various companies        participating in this technology domain?    -   Which companies are leading the development and which are        following?    -   What other areas of technology are related to this technology?    -   How much have these companies invested in this area of        technology?    -   How important are these patents to the companies that filed        them?    -   Which areas of technology have these companies abandoned and in        which are they continuing to invest?    -   What companies/inventors are collaborating in the development of        this technology?    -   Which inventors have changed companies?    -   What areas of technology are being bridged for the first time        and which companies own the patents that are bridging them?    -   Which patents should I cite as prior art for my current patent        application?    -   Which patents could potentially be used to invalidate my patents        or my competitors patents?    -   Which companies are most likely to be in violation of my        patents?    -   Which companies are most likely to be interested in licensing my        patents?    -   How quickly is academic research being translated into        patentable technology?    -   Which technologies in this domain are about to come off patent?    -   How active has the company/inventor of these patents been in        building on that technology and extending their patent        protection?    -   In which technologies are companies increasing their investment,        and in which are they abandoning their investments?    -   What technologies are being invested in within my industry?    -   What industries are making use of this technology?

Visualization of Patent Networks

The Network Visualization System described above can be readily appliedto the patent database to great effect. The world's patent databases areparticularly amenable to this kind of analysis because they containcitation information which provides natural linking information betweenpatents. The value of the NVS in the context of patents is particularlyrelevant because it makes it possible for interested parties other thanpatent attorneys and R&D engineers to make use of patent data.

Patent data is available from a variety of sources including the world'svarious patent offices (USPTO, EPO, etc. . . . ) as well as from patentdata providers like Thomson (including its subsidiaries and acquirees,Aureka, Micropatent, IHI and Delphion) and Lexis Nexis. This patent datais rich in information which can be transformed into network data. Someexamples of the kinds of data that can be converted into nodes and linksinclude but are not limited to the following:

-   -   Nodes/meta-nodes—patents, inventors, assignees, IPC classes, US        classes, Derwent classes, year of        priority/filing/publication/grant/expiration, semantic clusters,        status (application/granted/expired/abandoned), examiner,        inventor city/state/country, assignee city/state/country, filing        jurisdiction (US/EPO/WIPO etc.), priority number, and others as        well.    -   Links/meta-links—citations, co-citations, bibliographic        coupling, common IPC/US/Derwent class, common year of        priority/filing/publication/grant/expiration, common semantic        cluster, common status (application/granted/expired/abandoned),        common examiner, common inventor city/state/country, common        assignee city/state/country, common filing jurisdiction, common        priority number, and others as well.

Any combination of these node/meta-node and link/meta-link definitionsas well as ranges and combinations of the above can be used within theNVS to examine sets of patent data.

Further, these nodes and links can be sized to provide additionalinformation to the user as described above. Some particularly usefulattributes that can be used for sizing nodes and links in the patentcontext include:

-   -   Node/meta-node sizing—In the context of patents, several        specific metrics are relevant for node/meta-node sizing. Some        examples include; the metanodes can be sized based on the number        of patents, number of priority numbers (e.g. number of unique        patent families), number of times the patents are cited (forward        citations), number of patents cited by the represented patents        (backward citations), total citations (forward plus backward),        citations/year since publication/grant, patent years remaining        (e.g. sum of the years remaining on the represented patents),        average citations per patent, average patent age, average/total        number of IPC/US/Derwent classes, number of inventors, and many        other attribute metrics.    -    As mentioned before, nodes and meta-nodes can also be sized        based on any number of network statistic calculations like the        sum of centrality/eigenvector centrality/betweeness centrality        for the represented patents. The ability to size nodes based on        these various metrics enables the user to draw conclusions about        things like patent value, diversity of innovation, concentration        of inventorship and other important measures of interest to        users of patent data.    -    Measures of patent value are of particular interest, and there        are various attributes within the patent data (or other        exogenous data that can be linked to) that can give a signal of        patent value. Some specific examples include but are not limited        to: documents cited, citing documents, number of academic        citations, age of most recent citation, centrality/eigenvector        centrality/betweeness centrality, length of the patent        specification, number of claims, number of independent claims,        length of the shortest independent claim, breadth of coverage        (countries filed in), maintenance fee payment, post-grant        opposition (in Europe), maintenance fee payment, licensing,        litigation of the patent, R&D dollars/patent by the assignee,        average R&D dollars/patent in the industry, as well as others.        Some or all of these measures can be aggregated together using a        weighted average to provide a signal of value of the patents in        the network. These values can be summed to provide an estimate        of the value of a patent portfolio, and this value can be used        to size patent nodes or meta-nodes which represent patent        portfolios. This can provide the user with tremendous insight        about which patents and portfolios are most important within a        particular domain of interest.    -   Link/meta-link sizing—Links and meta-links can also be sized        based on various attributes in the same way as with node-sizing.        Link-strength can be based on a variety of different linkage        attributes. Some examples include number of citations, number of        unique patents cited, number of patents citing, average age of        citation, age of most recent citation, as well as others.    -    Links between patents and patent portfolios can signal two        important things, dependence and similarity. Dependence is a        measure of how much one patent or patent portfolio relies on or        is built off of another patent or patent portfolio. Measures        that signal dependence provide important signals about the        potential for infringement and are therefore critically        important in patent analysis. Some metrics that signal        dependence include times citing, times citing minus times cited,        times cited by patents citing the same patents you cite, etc.    -    Similarity is another important linking attribute in patent        analysis. Similarity between two patent portfolios suggests a        close parallel and perhaps redundancy between R&D programs of        two companies. Strategically, a high degree of similarity        suggests the potential for a joint venture or some other sort of        cost sharing potential. Measures that signal similarity between        two patents include total inter-citations, structural        equivalence (a network analysis term meaning that they hold the        same structural position within the network), co-citation,        bibliographic coupling, semantic similarity, etc.

All of the features of the NVS described above are relevant to theanalysis of patent data including network transformation, use ofmultiple nodes and links, fractal networks, network animation,statistical information and linking to external data sources. Someelements of the preferred embodiment of the NVS that are specific to theanalysis of patent data are described below.

Identifying Unassigned Patents

One of the unique challenges associated with using and linking patentdata in the NVS is that patent applications typically do not have anassignee associated with them. This is unfortunate since it means thatthe newest patents in the database, (the most cutting edge patents)cannot easily be identified by company. These patents also typically donot contain citations, and are not cited because they are brand new. Oneaspect of the implementation of the NVS for patent data is to resolvethis problem by creating an alternative link that connects them properlyto the network. This is accomplished by comparing attributes of theunassigned patents and patent applications to make a “best guess” aboutwhich company has filed the patent applications. Several attributes makethis possible including inventor names, inventor address, IPC/USPCclassifications, cited patents, prosecuting law firm, semantic data andothers.

By comparing these fields between the unassigned patents and otherpatents in the search results, it is possible to create a link thatshows which other patent in the database is most likely to be by thesame company. As an example, consider an unassigned patent that hasthree common inventors with the same addresses and is filed in the sameIPC class and prosecuted by the same law firm as another patent in thedatabase. It is highly likely that these patents were filed by the sameassignee.

The system reviews all of the unassigned patents and creates a linkbetween each patent and the most highly related other patent in thedatabase. Each of the relatedness criteria can be given a score and aweighted average used to determine the overall relatedness of the twodocuments. The user can than choose to “assign” patents with similarityabove a selected threshold to the assignee of the highly relateddocument. Alternatively, the user can review each linkage and choose toaccept or reject the proposed “assignment.” These assignments are markedwithin the NVS as “computer assigned” so that the user can tell thatthere remains some uncertainty about whether those patents are in factassigned to that particular assignee. The links created between theseunassigned patents and the most highly related patent are a differentclass of links that can be turned on and off at the users discretion.One particularly useful way to employ these links is to visualize acombined network diagram of assignees with a network of unassignedpatents. This allows the user to review all of the “computer assigned”patents by company in a single network view.

Statistical Information

Various types of statistical information is relevant for the analysis ofpatent data including:

-   -   Assignees—Number of patents in the selected network by assignee        sorted from highest to lowest.    -   Inventors—Number of patents in the selected network by Inventor        assignee sorted from highest to lowest.    -   Classification—Number of patents in the selected network by        classification code assignee sorted from highest to lowest or        sorted by classification category. Data can be provided for each        of several classification schemes including IPC, USPC, Derwent        classifications and others. Since many classification schemes        are hierarchical, the data can be displayed using a tree        structure with the number of patents within each category and        subcategory displayed alongside each branch of the tree.    -   Word usage—Number of patents containing key words, phrases or        word groupings. Several tools exist which identify common word        usage within document sets. These include Micropatent's        Themescape product, Vivisimo's clustering tools, Grokker's        clustering tool and others. These word clustering tools can        easily be incorporated into the system to provide additional        insight into the patent dataset under consideration.    -   Citation—Various types of information about citations can be        provided. These include but are not limited to the following:        -   Most frequently cited patents, assignees, inventors, or            other patent grouping.        -   Highest number of citations per year since issuance for            patents, assignees, inventors or other patent grouping.

Statistical information is also provided in pop-up windows. Severalexamples of the kind of information provided in the pop-ups that isspecifically related to patent data are described below:

-   -   Patent node pop-ups—When a patent node is selected, a pop-up        window can be called up which displays information about the        patent that is represented by the selected node. The information        provided includes all of the basic information provided on the        first page of a typical patent including patent number, title,        inventors, assignees, application number,        priority/filing/publication/grant dates, IPC/USPC classes, field        of search, citations (both patent and non-patent), examiner and        agents as well as other data from the patent like number of        pages, number of claims (independent and dependent), number of        figures, number of words in the shortest independent claim, etc.        In addition, many of the fields in the pop-up window are        hyperlinked allowing the user to pull up additional information.        For example, the patent number is hyperlinked to the full text        of the patent (or a .pdf), citation links are hyperlinks (links        to non-patent citations call up a Internet search for the cited        document), as well as other hyperlinks. The pop-up also can        include various statistical information about the patent (e.g.        centrality) and other information from external sources        including legal status, litigation status, licensing status,        other patents in the patent family, post-grant oppositions, file        wrapper information, etc.    -   Assignee meta-nodes—When an assignee meta-node is selected, a        pop-up window can be called up which displays a menu of        different kinds of data that can be displayed about the assignee        and the patents represented by the meta-node. Menu options        include tables showing a list of the patents represented the        meta-node, patents by IPC class, patents by USPC class, patents        by inventor, and a graph showing patents by year. Additional        menu options include network statistical information that can be        displayed about assignee meta-nodes including total citations,        average citations per year (since year of publication), the sum        of eigenvector centrality for the assignee's portfolio/the sum        of eigenvector centrality for the entire network (a measure of        portfolio value). Another menu option provides information about        the assignee. This menu option links to basic company and        financial information about the company. Various sources for        this kind of information can be used including Hoovers        <hoovers.com>, Bloomberg <bloomberg.com>, Yahoo Financial        <finance.yahoo.com> and many other sources including both public        and proprietary sites containing company profile information.    -   Inventor meta-nodes—When an inventor meta-node is selected, a        pop-up window can be called up which displays a menu of        different kinds of data that can be displayed about the inventor        and the patents represented by the meta-node. Menu options        include tables showing a list of the patents represented the        meta-node, patents by co-inventor, patents by IPC/USPC class,        and a graph showing patents by year. Another menu option        provides information about the inventor. This menu option links        to two different kinds of information, one is a basic web search        for the inventor's name, and a second is “people finder”        information from the World Wide Web. People finder sites such as        <people.yahoo.com/>, <zabasearch.com>, <intelius.com>,        <peoplefinders.com>, and many others can provide address        histories, date of birth, marriage/divorce/death information,        real estate records, liens and mortgages, bankruptcies, military        service, relatives, neighbors, credit checks and background        checks based on an individual's name, city and state which can        all be found directly in the patent information. This        information can be useful in finding inventors if the need        arises. It can also be used to identify inventor names within        the database that are likely to represent the same person.    -   IPC/USPC meta-nodes—When an IPC/USPC meta-node is selected, a        pop-up window can be called up which displays a menu of        different kinds of data that can be displayed about the IPC/USPC        and the patents represented by the meta-node. Menu options        include tables showing a list of the patents represented the        meta-node, patents by assignee class, patents by inventor, and a        graph showing patents by year. Another menu option provides        information about the assignee. This menu option provides        detailed information about the IPC/USPC class including a full        description of the class and its location in the IPC/USPC class        hierarchy, concordance information showing how the selected IPC        class is related to the USPC classes (or vice versa),        concordance information showing the link between the selected        IPC/USPC class and the SIC/NAICS related industries of use and        industries of manufacture. (The IPC/USPC to SIC/NAICS link is        discussed in detail later in the description of this        embodiment.)    -   Meta-links—When a meta-link is selected, a pop-up window can be        called up which displays information about the connections        represented by the meta-link. A table can be displayed showing a        list of patent to patent links represented, as well as a graph        of the number of individual links represented by the meta-link        over time. If for example, the meta-link is a co-inventorship        link, the meta-link pop-up will show a history of the        collaboration between the two inventors. If the meta-link is a        assignee-assignee citation link, the pop-up will show a history        of citations between the two assignees.

All of this pop-up information makes it possible for the user to explorethe patent network at any level of detail desired, from high levelmeta-data down to the deepest level of detail about companies,inventors, technologies and patents. This makes the patent networkvisualization tool a powerful tool for understanding a large set ofpatent documents.

Linking to External Data Sources

Some examples of useful exogenous data sources specifically related topatent data as well as their use within the patent network visualizationsystem are described below.

-   -   Industry data—One of the critical observations about making use        of patent data is that for most decision makers other than        patent attorneys and R&D engineers, the entity of interest is        not a patent. Rather, the user is typically interested in        knowing about a company, technology, inventor, or some other        entity. One particular entity of interest that many users would        like to understand better is industry information. Users often        want to know answers to questions like the following:        -   What technologies are critical to this industry?        -   What companies are leading the technology development in            this industry?        -   What industries make use of a particular technology?

Unfortunately, industry data is not directly attached to the patents inthe patent database. However, it is possible to link the data in thepatent database to industry data in two ways. First, theassignees/companies within the patent database can be linked to theindustry or industries in which they participate. Governments around theworld have made various attempts to develop standardized taxonomies ofindustries within their economies. The results are taxonomies like theSIC (Standard Industry Classification) codes and NAICS (North AmericanIndustry Classification System) codes. These codes classify companiesbased on the industries in which they participate.

Various databases contain directories of companies with informationabout their SIC/NAICS industries. One example is the Worldwide BusinessDirectory <siccode.com> which houses a database of companies and theirindustries.

By linking assignees in the patent database to their SIC/NAICS codes, itis possible to create meta-nodes within the network visualizationshowing which industries are represented in the patent data underexamination. Using the features described above, it is then possible toexamine relationships between and among industries, and therelationships between industries, companies, technologies, inventors,countries, and other entities represented within the patent data.

Another means by which patent data can be linked to industry data is byway of the US Patent Classification (USPC) codes or the InternationalPatent Classification (IPC) codes. This is made possible through various“technology-industry concordances”. During the period from 1990-1993,the Canadian Patent Office, in collaboration with Statistics Canadaassigned all new patent applications to both an SIC of Use and an SIC ofManufacture. This assignment was made for a total of about 148,000patents. This information has been used by various government entitiesand academic researchers to draw a correspondence between IPC/USPCtechnology classifications and SIC/NAICS industry classifications.

Tables have been established, which are publicly available, that showthe linkage between industries and technologies. Various versions ofthese tables can be obtained from sources on the World Wide Webincluding:

-   -   PC-US-SIC Concordance from UToronto    -    <rotman.utoronto.ca/.about.silverman/ipcsic/documentation_PC-S-IC_concordance.htm>    -   OECD Technology Concordance    -    <olis.oecd.org/olis/2002doc.nsf/linkto/dsti-doc(2002)5>    -   Yale Technology Concordance    -    <faculty1.coloradocollege.edu/.about.djohnson/jeps.html>    -   USPC to SIC Concordance here    -    <uspto.gov/web/offices/ac/ido/oeip/catalog/products/tafresh1.ht-m#U        SPC-SIC>

Using these tables, it is possible to link technology class informationto industry classification. This makes it possible for users of thepatent network visualization to analyze information about industries aspart of their patent data analysis.

The two sources of industry data can also be used simultaneously. Forinstance, the system can create SIC or NAICS meta-nodes and display themwithin the same graph (or as fractal nodes) related to the assignees inthe database. Simultaneously, the technology-industry concordance datacan be used to create a IPC or USPC network linked to those industrymeta-nodes. By so doing, it is possible to determine with a fair degreeof confidence in which industry a company is employing a particulartechnology. This capability addresses a common question in patent dataresearch, namely, “Of the various companies with patents related to thistechnology, which ones are employing the technology in my industry, andwhich are employing the technology in other industries?”

-   -   Legal status data—A second source of valuable exogenous data        related to patent data is the legal status databases such as        INPADOC. This database, along with others, contain the data        about maintenance fees, assignments, post-grant oppositions,        etc. Linkage to this data is technically simple as the patent        number or priority number can be directly linked to the        databases. The value of linking to this data is very high. The        patent network visualization system can use this data to        identify patents that have been abandoned (due to lack of        payment of maintenance fees), and reassigned. This provides a        strong signal about corporate priorities by showing where a        company's priorities are. This is accomplished by changing the        appearance of the patent nodes, assignee meta-nodes, IPC/USPC        meta-nodes, inventor meta-nodes or other nodes to show which        patents have been abandoned. The change in appearance could        include a change in color, shape, size, border style, or fill        style.    -    In addition, the legal status data can be used to show which        patents have changed hands, give evidence of an acquisition or        disposal of a business or business unit, and give a signal of        value of the patents. For example, a large number of patents        being reassigned to a new company would likely signal a change        of corporate structure. Another example is that patents that        have been opposed in a legal process are more likely to be        valuable patents as the parties would be unlikely to pursue an        opposition unless there was a significant economic incentive to        do so. Once again, this information can be used to change the        appearance of nodes or meta-nodes to signal important clues        about the patent data network to the user.    -   File wrapper data—Another source of data that is valuable to        link to is the patent office file wrapper data. In the United        States, this data can be found online at <USPTO.gov>. This data        is technically easy to link to as it contains either the        priority date or patent number which can be directly linked to        the patent database. The use of this information is two-fold.        First, it is useful for the user to be able to “click through”        to the file wrapper data for patent they are particularly        interested in. Secondly, the file wrapper data provides clues        about the value and validity of patents. The number of office        actions, claim rejections, changes in numbers of claims from        application to granted patent, time to respond to office actions        and other information found in the file wrapper all provide        information that can be useful to the user of the patent network        visualization system. The appearance of nodes and meta-nodes can        be altered to signal to the user the presence or value of any of        these file wrapper parameters.    -   Legal data—Another important source of information about patents        is the legal status of related legal proceedings. Information        about the existence of and status of patent litigation is        critical to understanding the patent landscape. By utilizing        this information in coordination with the capabilities of the        patent network visualization system described above, important        questions can be answered such as:        -   What patents have been validated through a court proceeding?        -   What patents are currently being litigated?        -   What companies are accused of patent infringement, and how            many cases are they facing?        -   Who is aggressively asserting patents in my industry?        -   What technologies are the most actively entangled in patent            litigation?    -   Licensing data—Another important source of information about        patents is the licensing data. Information about the patent        licensing provides a signal about the value of individual        patents and patent portfolios. Various licensing databases exist        including <yet2.com>, <royaltystat.com>, <royaltysource.com>,        The IP Transaction Database <fvgi.com>, IP Research Associates        Database <ipresearch.com>, Licensing Royalty Rates        <aspenpublishers.com>. Links can be made by patent number, by        company, by industry (SIC/NAICS) or by other means. By utilizing        this information in coordination with the capabilities of the        patent network visualization system described above, important        questions can be answered such as:        -   What patents are actively available for licensing?        -   What are the typical royalty rates associated with patents            in this industry?        -   Does this company license its patents?        -   What technologies are my competitors licensing in/out?    -   Corporate data—Corporate data is another source of exogenous        data that can be incorporated into the patent data visualization        system. Links can be made to company data by way of the assignee        field in the patent database. Various types of corporate data        exist from a myriad of sources. Examples of the type of data        that is particularly useful to link to include financial data,        and product data.    -    Various sources of corporate financial data exist ranging from        governmental systems like the SEC's EDGAR <sec.gov/edgar.shtml>        to data aggregators like Hoovers <hoovers.com> and Bloomberg        <bloomberg.com>. Financial information like sales, R&D spending,        market cap and many others can be used to bring further insight        to the patent data. For example, annual R&D spending can be        divided by the number of patents applications per year (with a        time lag) to compare relative R&D efficiency. Sales divided by        number of patents can be used as a signal of corporate        investment in future revenue streams. Market cap divided by        number of patents (or an estimate of portfolio value) to signal        how expensive or inexpensive a patent portfolio might be to        acquire. Comparison of these and many other measures can provide        insight about the relative performance of companies and the        importance, value and strength of their patent portfolios. These        metrics can be incorporated into the patent network        visualization system as attributes used to size related nodes        and meta-nodes or otherwise alter the appearance of nodes or        links in order to signal users of important information related        to their research    -    Product information is another source of important company        information that can provide further insight to users of the        patent network visualization system. Many companies have online        product catalogs. This information often contains technical        information that can be linked to the patent data by searching        the product database for key terms found in the patent        specification. The system can use the assignee information along        with keywords from the patent specification to create links to        product data on the company product catalog. These external        links can be displayed as nodes in the network diagram and can        allow users to see whether or not, and how, companies are        applying the technologies they have patented.    -    Academic data—Another source of exogenous data that can inform        a user's research on the patent database is academic data.        Academic data includes information from academic and industry        journals, conference proceedings, research grants, as well as        other sources. This information typically exists in the public        domain before a patent is issued. It therefore serves as a        marker of cutting edge research that is important to many users        of patent data. Links can be established between patent data and        academic data in a variety of ways. First, patents frequently        cite academic or industry journals as prior art. Second, the        inventors listed on patents often first publish their research        in academic literature, so links can be made by connecting        inventor names. Finally, academic literature can also be        connected to patents by way of the institutions or companies        that are assigned the patent and with whom a publication is        affiliated.    -    Information about the academic literature surrounding a body of        patents can be used to understand the sources of fundamental        research going on in the field, identify collaborations between        industry and academia, identify potential break-through        technologies before they emerge in the patent data, identify        start-up companies that have emerged from academic environments,        and many other insights.

By linking to academic data in the ways described above, it is possibleto use all of the capabilities of the network visualization system tocreate combined networks of academic literature and patent data. Thesecond embodiment of the network visualization described in detail belowdiscusses the use of the network visualization system for visualizingacademic literature. The combination of the two provides a quantum leapin the ability of users to understand the development of technology andthe networks of innovation among people, companies, industries andgeographies over what has come before.

Implementation by a Patent Office or Commercial Patent Data Vendor

This embodiment of NVS has been found by the inventors to be incrediblyuseful in making sense of patent data. The USPTO, EPO and other patentoffices as well as commercial patent data vendors such as Aureka,Micropatent, Delphion (all now owned by Thomson), Lexis Nexis, andothers, have large databases of patent information. Other than somesimple analysis tools as described above in the prior art section, noneof these patent offices of commercial patent data vendors provides itscustomers with sophisticated tools for the analysis of patent data. Theembodiment described here, either in full, or more likely a simpleversion with basic network visualization capabilities, would make a verypowerful front-end for their customers to use in accessing their patentdatabases.

Thomson or Lexis Nexis could choose to employ a very simplifiedimplementation of the NVS as a user interface for their patentdatabases. The simplified implementation might allow users to search thedatabases using Boolean searches and then return a list of documents asthey do today. They could then use the NVS to convert those searchresults into network data and allow users to choose from a limited setof network visualizations of the search result. The system might provideusers with the option to choose from one or more of the followingoptions:

-   -   Citation network—a network of patents in the result set linked        by citations    -   Assignee network—a network of assignees in the result set linked        by citations    -   Inventor network—a network of inventors linked by        co-inventorship links    -   IPC or USPC network—a network of IPC or USPC classes linked by        patents assigned to multiple classes    -   Assignee/Inventor network—a network of assignees linked by        citations, with inventors linked to the assignee nodes based on        the company to which they assigned their inventions    -   Assignee/IPC or USPC network—a network of assignees linked by        citations, with IPC or USPC classes linked to the assignee nodes        based on the number of patents by that company filed within the        IPC or USPC class

Additional features that they would likely want to include would be theability to filter the result set to limit the records in thevisualization. Filtering options should include the ability to filterout records from the visualization with particular; date ranges,assignees, IPC or USPC classes, inventors.

While this implementation of the NVS would lack many of the featuresdescribed in this embodiment, it would be a quantum leap for their usersin terms of their ability to understand the results of their patentsearches. It would allow them to examine their search results from manydifferent perspectives, refine their searches through the NVS filteringcapabilities, and ultimately examine patent lists or patent documentsdirectly through the NVS system.

It is also noteworthy that the two major commercial patent data vendorsare part of large organizations that house many other sources of data.The NVS as described in both this embodiment as well as the more generaldescription above could provide a front-end for all of their variousdata types. Also, some of the exogenous data sources described in thisembodiment are actually owned or offered through a licensing arrangementby the two major patent data companies, Thomson and Lexis Nexis. Theimplementation of the NVS as a front end for access to their databasesas well as a way to link disparate sources of data within their systemswould allow these companies to provide a highly differentiated valueproposition to their customers. These and other data vendors need tofind alternative ways for users of their databases to extract more valuefrom their data in order to grow and support higher prices and profitmargins. The NVS could make a significant contribution toward thoseobjectives.

EXAMPLE 2 Method and Apparatus for Searching for and Analyzing Documentsin a Medical Publication Database

In a second example, one or more embodiments of the invention areapplied to searching for and analyzing documents in a database ofacademic literature. One example of such a database is the medicalpublication database known as the PubMed database. Application to otheracademic databases is also possible.

The PubMed database is a large database of medical research papersappearing in nearly 200 medical journals. It is a rich repository ofinformation about the research domain in the world of medicine. ThePubMed database is most frequently used by doctors or other medicalprofessionals who are looking for information about a specific disease,treatment or other subject of medical interest. Their researchinvariably begins with a Boolean search for a keyword, author or journalafter which they are presented with a list of papers matching theirselected criteria. The researcher's next step is to scan through thesearch result list and read titles and abstracts until she finds a paperthat interests her. She then proceeds to read some or all of thearticle. She may then return to her search result and continue to scanand read results until she has found the information she is looking for.

This method is useful for medical professionals and is perfectlyappropriate if the researcher's objective is to find information that islikely to be WITHIN one or a few papers. However, there is a class ofquestions that cannot readily be answered in this way. We call thesemeta-questions. They are questions, not about what is contained withinthe articles, but about the meta-entities which those articlesrepresent. Rather than asking questions about articles and what iscontained in them, researchers are often interested in questions such asthe following:

-   -   What is happening in the field of gene therapy?    -   Who are the leading researchers on Alzheimer therapies?    -   What institutions are collaborating in researching        osteoarthritis?    -   How are the various domains of cancer research related to each        other?    -   Who are the most influential researchers in nanobiotechnology?    -   What research is going on in specific fields of medical science        such as disease groups (e.g., Alzheimer disease), therapies        (e.g., immunotherapy), or specific mechanisms of action (e.g.,        blocking plaque formation)? How quickly is the work progressing?        How has this changed over time?    -   What is the pattern of collaboration in a given field? Who is        involved? How are they working together? Where is there a        tightly knit research community? Where is it fragmented? Where        are the best opportunities to bridge clusters of research in        order, for example, to translate scientific discovery into        practice in MRSA? Has the pattern of collaboration improved,        diminished over time?    -   What is the intellectual structure of a given field, i.e., which        topics tend to researched together? Which not? Is this for        reasons of science or lapses in institutional and social        connection? What are the frequently repeated strong topic        connections? Which connections are emerging for the future?    -   How does the way that a domain is researched (i.e., the        intellectual structure as it evolves over time) affect the        collaboration patterns that emerge, and vice versa?    -   How do people within specific companies or universities        collaborate on specific topics? Among themselves? With others        beyond the institution?    -   Who are the most influential medical scientists in a given        therapeutic area? Who is most central to a network? What set of        medical scientists best collectively span a network with their        individual patterns of influence?

These questions, and many others, can be answered using the NVS asdescribed above. Academic data and specifically PubMed data can also beanalyzed and visualized in the same way as patent data. The attributesof patent data that make it analyzable by the methods described here aredirectly analogous the data found in academic databases includingPubMed. FIG. 20 shows the relationships between attributes of patentdata and attributes of academic literature.

As can be clearly seen, there is a direct correspondence between the twodata sources. This makes it possible to analyze the PubMed data (or anydatabase of academic literature) using the same methods as described inthe previous embodiment. However, not every attribute of patent andacademic data are directly analogous. Therefore, it is necessary toslightly modify the methods described in the patent networkvisualization system for specific use with academic literature.

Just as with the patent visualization system, the medical networkvisualization system allows the user to create and examine databaserecords from the academic database as a network of nodes and links. Aswith the patent database, the academic data is not initially structuredas network information. In other words, it does not contain a node listand a link list. Before it can be visualized as a network, academic datamust first be restructured in order to convert it into networkinformation. This is accomplished in exactly the same way as describedin the previous embodiment.

Once the data is structured, it is possible for the user to view variousnetwork visualizations based on database records from the PubMeddatabase or other academic database. Unlike prior art systems, themedical network visualization system does not require a stabledefinition of the nodes and links of the network. Rather, the researchercan change the definition of both nodes and links dynamically accordingto her interest. This ability to transform the network from onenode/link definition to another, and the ability to simultaneously viewmultiple connected network views of the same data, makes it possible forthe user to quickly and easily make sense of a large set of databaserecords, and answer meta-level questions that can not be answeredthrough any other means.

As with patent data, the network of medical data can be thought of in avariety of ways with different definitions of what is a node and what isa link. Some examples of the various node/meta-nodes and link/meta-linksthat can be created from the PubMed data are described below.

-   -   Nodes/meta-nodes—Node/meta-node definitions in the medical data        context include but are not limited to articles, papers, grants,        reviews, authors, journals, year of publication, reviewer,        author city/state/country, institution city/state/country,        journal country, MeSH categories, and others as well.    -   Links/meta-links—citations, co-citations, bibliographic        coupling, common MeSH class, common year of publication, common        semantic cluster, common reviewer, common author        city/state/country, common institution city/state/country,        common journal country, and others as well

Any combination of these node/meta-node and link/meta-link definitionsas well as ranges and combinations of the above can be used within theNVS to examine sets of PubMed data.

Further, these nodes and links can be sized to provide additionalinformation to the user as described above. Some particularly usefulattributes that can be used for sizing nodes and links in the medicaldata context include:

-   -   Node/meta-node sizing—In the context of medical data, several        specific metrics are relevant for node/meta-node sizing. Some        examples include; the metanodes can be sized based on the number        of articles, number of times the articles are cited (forward        citations), number of articles cited by the represented patents        (backward citations), total citations (forward plus backward),        citations/year since publication, average citations per article,        average/total number of MeSH categories, number of authors, and        many other attribute metrics.    -    As mentioned before, nodes and meta-nodes can also be sized        based on any number of network statistic calculations like the        sum of centrality/eigenvector centrality/betweeness centrality        for the represented articles. In the context of medical        research, these metrics signal how important the research is        based on peer citations. The ability to spot important research        is critically important for biotech and pharmaceutical and other        life-sciences companies as they try to stay on the cutting edge        of research and access leading research that will help them        introduce the next blockbuster drug or highly profitable medical        device. The ability to size nodes and meta-nodes in the NVS        makes it easy to spot important research within a large and        complex area of medical science.    -   Link/meta-link sizing—Some examples of attributes for        link/meta-link sizing that are relevant in the medical domain        include number of citations, number of unique articles cited,        number of articles citing, average age of citation, age of most        recent citation, as well as others.

Examples of the Medical Network Visualization System

Shown below are various exemplary screen shots illustrating variousnetwork diagrams generated using the Medical Network VisualizationSystem (MNVS) for performing searching and analysis of particularmedical documents. These maps, although based on a limited set of nodeand link combinations, reveal the capabilities available to a user ofthe network visualization system. For these simple examples, both nodesand meta-nodes maintain constant sizing, and links and meta-linksmaintain constant widths. However, as with the Patent NetworkVisualization System, these parameters can also be altered to provideeven more insight to the user.

FIG. 21 shows a typical network based on PubMed data and visualizedusing the MNVS described herein. It is the result of a search fordocuments written from January 2000 to the present with the medicalsubject heading (MeSH) “Diabetes Mellitus Type I” and the location“Boston.” The tool retrieved 124 documents in order to create the map ofthe network. One can interpret the links in the map as per the keyabove.

The network diagram displays three different kinds of meta-nodes (authormeta-nodes, journal meta-nodes, and MeSH meta-nodes). In a preferredembodiment of the medical network visualization system, the node typesare differentiated by different colors (authors—black on yellow,journals—white on green, and MeSH—blue on white). Although these aredifficult to see in a black and white representation, they appear asfollows (authors—dark on light, journals—white on dark, MeSH—dark onwhite).

The MNVS provides the user the capability to choose node and linkdefinitions as she works. FIG. 22 demonstrates this capability as itshows a detail of the same network as in FIG. 21, but displays only theAuthor-Author links, which reveal the social network of the scientificcommunity. From these kinds of network diagrams, it is possible to learnwho the leading researchers are within a particular field of study, withwhom they collaborate and which scientists are most influential.

One unique element of medical journal databases is the significance ofthe order of author names on an article. Based on interviews and ourexperience with this kind of analysis, we have learned that the firstauthor on a medical journal article is the Principle Investigator (PI)on the research. If a second PI is involved in the research (as is oftenthe case), her name will appear second. The last name in the author listis the head of the laboratory in which the research was conducted. This“lab head” probably had little involvement in the actual researchproject, but is likely to be an important person in the field. The namesfalling between the second and last name in the author list aretypically laboratory assistants and other minor contributors to thepiece.

For this reason, the MNVS allows users to select which author names toinclude in the network. One useful setting we have discovered is toinclude first, second and last authors names in the network and excludeall others.

FIGS. 23 and 24 further demonstrate the capabilities of the MNVS. Herethe same network is transformed into a network of authors and journals(FIG. 23) and authors and MeSH categories (FIG. 24). These networksenable a user to quickly understand the areas of research interest ofresearchers within the network.

Finally, FIG. 25 shows the same network once again, however this time itdisplays the network of links between medical topic areas as designatedby the MeSH categories. Using the MNVS in this way, medicalprofessionals are often surprised to find an unexpected nexus betweentwo medical fields that on the surface appear unrelated. Observationsabout unexpected connections between medical subjects can lead to newways to think about medical problems and suggest new paths for researchas they offer the potential to apply findings in one field of study tochallenges in another area of research. The medical profession tends tobe silo-ed by professional specialty because specialists in Field Ararely mix with specialists in Field B because they do not attend thesame conferences, participate in the same residency programs, read thesame journals or otherwise interact. There is tremendous value inputting together the right people from the different specialties,because entirely new paths of inquiry are often suggested. The medicalnetwork visualization system makes it possible to instantaneouslyobserve unexpected areas of connection from which new medical insightmay emerge.

Applications of Network Visualization of Medical/Academic Data

The capabilities enabled by the medical network visualization systemgive researchers the ability to analyze and develop deep insights intolarge sets of medical database records. These insights come in variousforms and therefore, the network visualization system can be used invarious contexts to analyze topics such as:

-   -   Organization of collaboration (in general)    -   Organization of collaboration within a company    -   Who to target as Key Opinion Leaders (KOLs) or key researchers        in a geography for clinical trials or market influence    -   Topic clustering in a particular field (which MeSH categories go        together)    -   Research synergies or substitutions across organizations    -   Regional bases of research strength in a broader geography

Organization of Collaboration in General

At a basic level, the network in FIG. 26 shows different ‘clusters’ ofcollaboration. A user can easily identify groups of authors who havepublished together in various journals. Another feature of the system isthe ability to color author meta-nodes based on the institutionalaffiliation of the author. This provides even deeper insight into thepatterns of collaboration.

Organization of Collaboration within a Company

The network in FIG. 27 is the result of a search of the MeSH categoryDiabetes Mellitus and the institution Joslin (institution is found aspart of the address field in PubMed). Joslin is short for JoslinDiabetes Center—the world's leading diabetes center. The diagramidentifies pockets of collaboration—people within the organization whoco-author documents on specific topics. The diagram displays in fulltext the names of journals and MeSH terms that appear on five or moredocuments in the search. This enables one to see popular research topicssuch as Diabetic retinopathy and Islets of Langerhans, as well asjournals that this organization has published in since 2000 includingDiabetes, Diabetologia, Transplantation, and Diabetes Care, among manyothers.

Targeting Key Opinion Leaders (KOLs) or Key Researchers in a Geographyfor Clinical Trials or Market Influence

In FIG. 28, the diabetes mellitus search has been limited by geography(Australia) instead of institution. Here, the network map is restrictedto show only those authors who have written 15 or more documents. CooperM E is an author whose name appears on 49 documents—subject to furtherinvestigation, it is likely that Dr. Cooper is key opinion leader inAustralia that a pharmaceutical or biotech company would want to targetif it is marketing a diabetes drug.

Topic Clustering in a Particular Field (Which MeSH Categories GoTogether)

FIG. 29 shows a network resulting from a search that is limited byorganization to highlight that a user can also see related MeSHcategories. Here all other nodes and links have been removed and what isleft reveals that documents share MeSH categories. For instance, in thisexample documents with the MeSH category Diabetes Metllitus, Type II arealso coded as Obesity, Islets of Langerhans, Blood Glucose, and Insulin,among many others.

Research Synergies or Substitutions Across Organizations

The network shown in FIG. 30 results from a search on the MeSH categoryCardiovascular Agents and three specific organizations. The followingorganizations and or combinations of them have been highlighted usingthe Color Query feature: Pharmacia (yellow), Pfizer (green), WarnerLambert (blue), and the combination of Pfizer and Pharmacia (purple).This map enables a user to see which MeSH topics organizations' researchfalls under within a larger domain. This could help organizations thinkstrategically about investment of resources in certain researchprojects, the competition in a particular research area, and/or emergingareas that they are not yet involved in.

Regional Bases of Research Strength in a Broader Geography

The MNVS can also help reveal research “hubs” across geographies. InFIG. 31, a search has been designed to highlight research under the MeSHcategory Diabetes Mellitus Type I in Massachusetts (color coded green),California (color coded pink), and North Carolina (color coded blue). Auser could run a similar search without geographic restraints andexplore the data to see what areas seem to emerge as “hubs.”Additionally, a user may be able to identify geographies that may befocused on a smaller niche within the broad domain (e.g. Autoantigens inthis diabetes research example).

As demonstrated in these examples, all of the features of the NVSdescribed above are relevant to the analysis of PubMed data includingnetwork transformation, use of multiple nodes and links, fractalnetworks, network animation, statistical information and linking toexternal data sources. Some elements of the preferred embodiment of theNVS that are specific to the analysis of PubMed data are furtherdescribed below.

Statistical Information

Various types of statistical information is relevant for the analysis ofPubMed data including:

-   -   Institutions—Number of articles in the selected network by        institution sorted from highest to lowest.    -   Authors—Number of articles in the selected network by Author        sorted from highest to lowest.    -   Classification—Number of articles in the selected network by        MeSH category sorted from highest to lowest or sorted by        classification category. Since the MeSH classification schemes        is hierarchical, the data is displayed using a tree structure        with the number of articles within each category and subcategory        displayed alongside each branch of the tree.    -   Word usage—Number of articles containing key words, phrases or        word groupings.    -   Citation—Various types of information about citations can be        provided. These include but are not limited to the following:        -   Most frequently cited articles, institutions, authors, or            other grouping.        -   Highest number of citations per year since publication for            articles, institutions, authors or other grouping.

Statistical information is also provided in pop-up windows. Severalexamples of the kind of information provided in the pop-ups that isspecifically related to PubMed data are described below:

-   -   Article node pop-ups—When a article node is selected, a pop-up        window can be called up which displays information about the        article that is represented by the selected node. The        information provided includes all of the basic information        provided on the summary page of a typical article including        PubMed ID number, title, authors, institutions, publication        dates, MeSH classes, citations, as well as other data from the        article like number of pages, number of figures, number of        words, etc. In addition, many of the fields in the pop-up window        are hyperlinked allowing the user to pull up additional        information. For example, the article number is hyperlinked to        the full text of the article (or a .pdf), citation links are        hyperlinks, as well as other hyperlinks. The pop-up also can        include statistical information about the article like        centrality.    -   Institution meta-nodes—When an institution meta-node is        selected, a pop-up window can be called up which displays a menu        of different kinds of data that can be displayed about the        institution and the articles represented by the meta-node. Menu        options include tables showing a list of the articles        represented the meta-node, articles by MeSH category, articles        by author, and a graph showing articles by year. Additional menu        options include network statistical information that can be        displayed about institution meta-nodes including total        citations, average citations per year (since year of        publication), the sum of eigenvector centrality for the        institution's portfolio/the sum of eigenvector centrality for        the entire network (a measure of portfolio value). Another menu        option provides information about the institution. This menu        option links to the institutions website or to basic company and        financial information about the company.    -   Author meta-nodes—When an author meta-node is selected, a pop-up        window can be called up which displays a menu of different kinds        of data that can be displayed about the author and the articles        represented by the meta-node. Menu options include tables        showing a list of the articles represented the meta-node,        articles by co-author, articles by MeSH class, and a graph        showing articles by year.    -   MeSH meta-nodes—When a MeSH meta-node is selected, a pop-up        window can be called up which displays a menu of different kinds        of data that can be displayed about the MeSH category and the        articles represented by the meta-node. Menu options include        tables showing a list of the articles represented the meta-node,        articles by institution, articles by author, and a graph showing        articles by year. Another menu option provides information about        the MeSH category. This menu option provides detailed        information about the MeSH category including a full description        of the class and its location in the MeSH hierarchy.    -   Meta-links—When a meta-link is selected, a pop-up window can be        called up which displays information about the connections        represented by the meta-link. A table can be displayed showing a        list of article to article links represented, as well as a graph        of the number of individual links represented by the meta-link        over time. If for example, the meta-link is a co-authorship        link, the meta-link pop-up will show a history of the        collaboration between the two authors. If the meta-link is an        institution-institution citation link, the pop-up will show a        history of citations between the two institutions.

All of this pop-up information makes it possible for the user to explorethe article network at any level of detail desired, from high levelmeta-data down to the deepest level of detail about institutions,authors, fields of sturdy and articles. This makes the MNVS tool apowerful tool for understanding a large set of PubMed documents.

Linking to External Data Sources

Some examples of useful exogenous data sources specifically related tomedical data as well as their use within the MNVS are described below.

-   -   Doctor affiliation data—One valuable source of external data to        link to related to the PubMed database is information about the        affiliation of the researchers. Most authors in PubMed are        doctors, and it is possible to link to information about those        doctors in both public and proprietary databases. These        databases contain information like medical specialty, hospital        privileges, DEA#, medical school attended, residency programs        completed, medical association membership, etc. Linking to this        data makes it possible to create whole new categories of        meta-nodes as well as new types of links that cannot be made        through the PubMed data alone.    -   Script data—Another highly valuable source of exogenous data is        Script data. Proprietary databases such as IMS maintain        information about the prescribing patterns of doctors. They        calculate the number of prescriptions that doctors write for        each and every drug they prescribe. This data is incredibly        valuable as a source of information to biotechnology and        pharmaceutical companies to determine which doctors are the most        important to reach from a marketing standpoint. When combined        with the MNVS, the tool enables biotech and pharma companies to        identify key opinion leaders (KOLs) that are most closely        connected to the largest number of subscribers of the        medications in the therapeutic area of interest. By targeting        these KOLs, the companies can influence the prescription        patterns of the doctors and capture market share.    -   Referral data—Proprietary databases like LRX also provide a        source of valuable external data to link to. The LRX database        captures doctor referral information which can be used to create        a social network of medical relationships within and across        specialties.    -   Survey data—Companies like Alpha Detail conduct surveys of many        thousands of doctors to determine what information they read,        and what other doctors they are influenced by. This data is        valuable as another source of exogenous data particularly in        addressing the marketing questions of life-sciences companies.    -   Grant data—Often the step before medical literature is published        is the submission and approval of a grant. In the US, the vast        majority of these grants come through the National Institutes of        Health. The NIH maintains a database known as the Computer        Retrieval of Information on Scientific Projects (CRISP). This        database has information about all NIH funded research projects.        Linking to this data makes it possible to track innovative        research even earlier than the first medical article        publication. Other countries also maintain similar databases.    -   FDA trial data—The FDA maintains information in public databases        about the various drug candidates that are in various phases of        the FDA approval process. By linking to this data, it is        possible to analyze how medical research feeds into the drug        pipeline and assess the position of the various drug an drug        companies.    -   FDA product data—At the other end of the time scale is the FDA        databases. The goal of most medical research is to develop a        treatment for some disease which in most cases must be approved        by the FDA (within the U.S.). The FDA maintains the DRUG        database and many other databases that provide information about        over-the-counter and prescription drugs as well as food        supplements and many other health related products. By linking        to this data, it is possible to track the output of the research        contained in the PubMed database.    -   Institutional data—Institutional data is another source of        exogenous data that can be incorporated into the MDVS. Links can        be made to institution or company data by way of the institution        field in the PubMed database or indirectly through a database of        institutional affiliations held by the doctor/author. Various        types of institutional/corporate data exist from a variety of        sources. Linking to this data makes it possible to analyze more        deeply the role that companies, universities, government        entities and research institutions play within an area of        research interest.    -   Patent data—Links to the patent data are also critically        important. The patent data represents the portions of medical        research that have been converted to protectable intellectual        property rights.

Combining the Patent and Medical NVS

Although the patent and PubMed embodiments have been describedseparately, the NVS is also capable of combining these two data sources.Links between the two data sources come in a variety of forms includingcitations from patents to academic literature, article authors andinventors can be linked, company/institutions can also be linked. Bylinking the medical research data with patent data as well as grant, FDAand script data, it is possible to get a picture of the entire lifecycleof an idea from inception all the way through product approval andmarketing.

The NVS enables a much deeper understanding of the nature of scientificand technological development than has ever been possible before. Manydifferent kinds of questions can be answered that have been unanswerableby any means known in the prior art. Many of those questions have veryhigh value both economically, and for the good of society.

OTHER EMBODIMENTS

It should be obvious to one skilled in the art that the NVS can beapplied in a broad array of contexts. The embodiments describeddemonstrate the applicability of the NVS to two different data sources.Application to many other sources is possible in similar ways to thosepresented in the described embodiments.

Computer Implementation

The method of analyzing database records in accordance with the variousembodiments of the invention is preferably implemented in ageneral-purpose computer 300, as shown in FIG. 32. A representativecomputer 300 is a personal computer or workstation platform that is,e.g., Intel Pentium®, PowerPC® or RISC based, and includes an operatingsystem such as Windows®, Linux®, OS/2®, Unix or the like. As is wellknown, such machines include a display interface 302 (a graphical userinterface or “GUI”) and associated input devices 304 (e.g., a keyboardor mouse).

The database records analysis method is preferably implemented insoftware, and accordingly one of embodiments is as a set of instructions306 (e.g., program code) in a code module resident in acomputer-readable medium such as random access memory 308 of thecomputer 300. Until required by the computer 300, the set ofinstructions 306 may be stored in another computer-readable medium 310,e.g., in a hard disk drive, or in a removable memory such as an opticaldisk (for eventual use in a CD ROM) or floppy disk (for eventual use ina floppy disk drive), or downloaded via the Internet or some othercomputer network. In addition, although the various methods describedare conveniently implemented in a general-purpose computer 300selectively activated or reconfigured by software, one of ordinary skillin the art would also recognize that such methods may be carried out inhardware, in firmware, or in more specialized apparatus constructed toperform the specified method steps.

Other aspects, modifications, and embodiments are within the scope ofthe claims.

1. A method of providing a network graphical representation of two ormore database records, comprising: selecting the two or more databaserecords according to one or more descriptive criteria, identifying oneor more attributes of the record class, and associating network nodes toinstances of the one or more attributes from the database records; andconnecting the network nodes with network links that designate networknodes having common instances of the one or more attributes.